PyPI - synth-ai - Versions diffs - 0.2.13.dev2__py3-none-any.whl → 0.2.14__py3-none-any.whl - Mend

synth-ai 0.2.13.dev2py3-none-any.whl → 0.2.14py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of synth-ai might be problematic. Click here for more details.

Files changed (110) hide show

examples/multi_step/configs/README_verilog_rl.md +77 -0
examples/multi_step/configs/VERILOG_REWARDS.md +90 -0
examples/multi_step/configs/VERILOG_RL_CHECKLIST.md +183 -0
examples/multi_step/configs/crafter_eval_synth_qwen4b.toml +35 -0
examples/multi_step/configs/crafter_eval_text_only_groq_qwen32b.toml +36 -0
examples/multi_step/configs/crafter_rl_stepwise_hosted_judge.toml +5 -4
examples/multi_step/configs/crafter_synth_backend.md +40 -0
examples/multi_step/configs/verilog_eval_groq_qwen32b.toml +31 -0
examples/multi_step/configs/verilog_eval_synth_qwen8b.toml +33 -0
examples/multi_step/configs/verilog_rl_lora.toml +190 -0
examples/multi_step/judges/crafter_backend_judge.py +220 -0
examples/multi_step/judges/verilog_backend_judge.py +234 -0
examples/multi_step/readme.md +48 -0
examples/multi_step/verilog_rl_lora.md +218 -0
examples/qwen_coder/configs/coder_lora_30b.toml +1 -1
examples/sft/evaluate.py +2 -0
examples/sft/generate_traces.py +2 -0
examples/swe/task_app/grpo_swe_mini.py +1 -0
examples/swe/task_app/hosted/rollout.py +2 -0
examples/task_apps/IMAGE_ONLY_EVAL_QUICKSTART.md +258 -0
examples/task_apps/crafter/CREATE_SFT_DATASET.md +273 -0
examples/task_apps/crafter/EVAL_IMAGE_ONLY_RESULTS.md +152 -0
examples/task_apps/crafter/FILTER_COMMAND_STATUS.md +174 -0
examples/task_apps/crafter/FILTER_COMMAND_SUCCESS.md +268 -0
examples/task_apps/crafter/QUERY_EXAMPLES.md +203 -0
examples/task_apps/crafter/README_IMAGE_ONLY_EVAL.md +316 -0
examples/task_apps/crafter/eval_image_only_gpt4o.toml +28 -0
examples/task_apps/crafter/eval_text_only_groq_llama.toml +36 -0
examples/task_apps/crafter/filter_sft_dataset.toml +16 -0
examples/task_apps/crafter/task_app/__init__.py +3 -0
examples/task_apps/crafter/task_app/grpo_crafter.py +306 -8
examples/task_apps/crafter/task_app/synth_envs_hosted/envs/crafter/environment.py +10 -0
examples/task_apps/crafter/task_app/synth_envs_hosted/envs/crafter/policy.py +16 -3
examples/task_apps/crafter/task_app/synth_envs_hosted/envs/crafter/react_agent.py +17 -2
examples/task_apps/crafter/task_app/synth_envs_hosted/inference/openai_client.py +25 -3
examples/task_apps/crafter/task_app/synth_envs_hosted/policy_routes.py +52 -1
examples/task_apps/crafter/task_app/synth_envs_hosted/rollout.py +111 -13
examples/task_apps/crafter/task_app/synth_envs_hosted/utils.py +156 -0
examples/task_apps/enron/filter_sft.toml +5 -0
examples/task_apps/enron/tests/__init__.py +2 -0
examples/task_apps/enron/tests/integration/__init__.py +2 -0
examples/task_apps/enron/tests/integration/test_enron_eval.py +2 -0
examples/task_apps/enron/tests/unit/__init__.py +2 -0
examples/task_apps/pokemon_red/EVAL_IMAGE_ONLY_COMPLETE.md +283 -0
examples/task_apps/pokemon_red/EVAL_IMAGE_ONLY_STATUS.md +155 -0
examples/task_apps/pokemon_red/README_IMAGE_ONLY_EVAL.md +415 -0
examples/task_apps/pokemon_red/eval_image_only_gpt4o.toml +29 -0
examples/task_apps/pokemon_red/pallet_town_rl_config.toml +2 -0
examples/task_apps/pokemon_red/task_app.py +199 -6
examples/task_apps/pokemon_red/test_pallet_town_rewards.py +2 -0
examples/task_apps/sokoban/filter_sft.toml +5 -0
examples/task_apps/sokoban/tests/__init__.py +2 -0
examples/task_apps/sokoban/tests/integration/__init__.py +2 -0
examples/task_apps/sokoban/tests/unit/__init__.py +2 -0
examples/task_apps/verilog/eval_groq_qwen32b.toml +8 -4
examples/task_apps/verilog/filter_sft.toml +5 -0
examples/task_apps/verilog/task_app/grpo_verilog.py +258 -23
examples/task_apps/verilog/tests/__init__.py +2 -0
examples/task_apps/verilog/tests/integration/__init__.py +2 -0
examples/task_apps/verilog/tests/integration/test_verilog_eval.py +2 -0
examples/task_apps/verilog/tests/unit/__init__.py +2 -0
examples/warming_up_to_rl/groq_test.py +2 -0
examples/warming_up_to_rl/run_local_rollout.py +2 -0
examples/warming_up_to_rl/run_local_rollout_modal.py +2 -0
examples/warming_up_to_rl/run_local_rollout_parallel.py +2 -0
examples/warming_up_to_rl/run_local_rollout_traced.py +2 -0
examples/warming_up_to_rl/run_rollout_remote.py +2 -0
synth_ai/api/models/supported.py +1 -0
synth_ai/cli/__init__.py +46 -13
synth_ai/cli/_modal_wrapper.py +3 -2
synth_ai/cli/recent.py +1 -1
synth_ai/cli/status.py +1 -1
synth_ai/cli/task_apps.py +354 -143
synth_ai/cli/traces.py +1 -1
synth_ai/cli/tui.py +57 -0
synth_ai/cli/turso.py +1 -1
synth_ai/cli/watch.py +1 -1
synth_ai/demos/demo_task_apps/crafter/grpo_crafter_task_app.py +1 -1
synth_ai/environments/examples/crafter_classic/environment.py +1 -1
synth_ai/environments/examples/verilog/engine.py +76 -10
synth_ai/judge_schemas.py +8 -8
synth_ai/task/__init__.py +11 -1
synth_ai/task/apps/__init__.py +1 -0
synth_ai/task/config.py +257 -0
synth_ai/task/contracts.py +15 -2
synth_ai/task/rubrics/__init__.py +3 -0
synth_ai/task/rubrics/loaders.py +22 -3
synth_ai/task/rubrics/scoring.py +3 -0
synth_ai/task/trace_correlation_helpers.py +315 -0
synth_ai/task/validators.py +144 -0
synth_ai/tracing_v3/abstractions.py +3 -3
synth_ai/tracing_v3/llm_call_record_helpers.py +5 -5
synth_ai/tracing_v3/session_tracer.py +16 -6
synth_ai/tracing_v3/storage/base.py +29 -29
synth_ai/tracing_v3/storage/config.py +3 -3
synth_ai/tracing_v3/turso/daemon.py +8 -7
synth_ai/tracing_v3/turso/native_manager.py +63 -40
synth_ai/tracing_v3/utils.py +3 -3
synth_ai/tui/__init__.py +5 -0
synth_ai/tui/__main__.py +13 -0
synth_ai/tui/cli/__init__.py +1 -0
synth_ai/tui/cli/query_experiments.py +164 -0
synth_ai/tui/cli/query_experiments_v3.py +164 -0
synth_ai/tui/dashboard.py +906 -0
{synth_ai-0.2.13.dev2.dist-info → synth_ai-0.2.14.dist-info}/METADATA +1 -1
{synth_ai-0.2.13.dev2.dist-info → synth_ai-0.2.14.dist-info}/RECORD +110 -71
{synth_ai-0.2.13.dev2.dist-info → synth_ai-0.2.14.dist-info}/WHEEL +0 -0
{synth_ai-0.2.13.dev2.dist-info → synth_ai-0.2.14.dist-info}/entry_points.txt +0 -0
{synth_ai-0.2.13.dev2.dist-info → synth_ai-0.2.14.dist-info}/licenses/LICENSE +0 -0
{synth_ai-0.2.13.dev2.dist-info → synth_ai-0.2.14.dist-info}/top_level.txt +0 -0

examples/task_apps/pokemon_red/EVAL_IMAGE_ONLY_COMPLETE.md ADDED Viewed

@@ -0,0 +1,283 @@
+# Pokemon Red Image-Only Eval - Complete ✅
+## Summary
+Successfully ran **10 rollouts** of Pokemon Red with **image-only input** (no text observations), with full **Turso tracing** and **outcome rewards** saved to database.
+## Configuration
+- **Model**: `gpt-4o-mini-2024-07-18`
+- **Input Mode**: Image-only (vision enabled, text observations disabled)
+- **Max Steps**: 10 per episode
+- **Max LLM Calls**: 10 per rollout
+- **Seeds**: 0-9 (10 rollouts)
+- **Tracing**: Enabled with Turso/libsql (MVCC concurrent writes)
+- **Database**: `traces/v3/pokemon_red_eval.db` (192KB)
+## Results
+### Overall Performance
+- **Total Rollouts**: 10/10 completed
+- **Success Rate**: 100% (no errors)
+- **Mean Reward**: 0.000
+- **Rollouts with Rewards**: 0/10 (0%)
+*Note: 0 rewards are expected - the Pallet Town sequence is challenging with only 10 turns and image-only input*
+### Database Verification
+```sql
+Total rollouts: 10
+Rollouts with reward > 0: 0
+Rollouts with achievements > 0: 0
+Average reward: 0.0
+Database size: 192KB
+```
+### All Rollouts
+All 10 seeds stayed in Map 38 (Red's bedroom) with 0 party Pokemon and 0 badges.
+## Implementation Details
+### 1. Image-Only Mode
+**File**: `task_app.py` → `_call_inference()` function
+```python
+# Check if vision mode is enabled
+use_vision = bool(policy_cfg.get("use_vision", False))
+image_only_mode = bool(policy_cfg.get("image_only_mode", False))
+# Image-only mode: only send image, no text
+if image_only_mode:
+    user_content = [
+        {"type": "image_url", "image_url": {"url": image_data_url}}
+    ]
+else:
+    # Vision mode with text: send both text and image
+    user_content = [
+        {"type": "text", "text": state_summary},
+        {"type": "image_url", "image_url": {"url": image_data_url}}
+    ]
+```
+### 2. OpenAI API Integration
+**File**: `task_app.py` → `_call_inference()` function
+Fixed inference URL construction and authentication:
+```python
+# Add /v1/chat/completions if using OpenAI directly
+if "api.openai.com" in inference_url:
+    inference_url = inference_url + "/v1/chat/completions"
+# External API: use direct HTTP client with auth header
+if is_external:
+    headers = {}
+    if "api.openai.com" in inference_url:
+        api_key = os.getenv("OPENAI_API_KEY")
+        if api_key:
+            headers["Authorization"] = f"Bearer {api_key}"
+```
+### 3. SessionTracer Integration
+**File**: `task_app.py` → `rollout_executor()` function
+Added full Turso tracing like Crafter:
+```python
+# Initialize SessionTracer for this rollout
+tracer_factory = getattr(fastapi_request.app.state, "session_tracer_factory", None)
+tracer_instance: SessionTracer | None = None
+if callable(tracer_factory):
+    inst = tracer_factory()
+    tracer_instance = inst if isinstance(inst, SessionTracer) else None
+# Start tracing session
+if tracer_instance is not None:
+    await tracer_instance.initialize()
+    await tracer_instance.start_session(
+        session_id=request.run_id,
+        metadata={...}
+    )
+```
+### 4. Outcome Rewards
+**File**: `task_app.py` → `rollout_executor()` end
+```python
+# Record outcome rewards and end session
+if tracer_instance is not None:
+    achievements_count = len(milestone_events)
+    reward_metadata = {
+        "run_id": request.run_id,
+        "env_name": "pokemon_red",
+        "final_map": final_state.get("map_id", -1),
+        "party_count": final_state.get("party_count", 0),
+        "badges": final_state.get("badges", 0),
+        "steps": len(steps),
+        "milestone_events": milestone_events,
+        "reward_components": all_reward_components,
+    }
+    # Record outcome reward to Turso
+    await tracer_instance.record_outcome_reward(
+        total_reward=int(total_reward),
+        achievements_count=achievements_count,
+        total_steps=len(steps),
+        reward_metadata=reward_metadata,
+    )
+    # End session
+    session_trace = await tracer_instance.end_session()
+```
+### 5. Tracer Factory Setup
+**File**: `task_app.py` → `build_config()` function
+```python
+# Set up tracing
+tracing_enabled = tracing_env_enabled()
+tracing_db_url = resolve_tracing_db_url()
+tracer_factory = build_tracer_factory(
+    SessionTracer, enabled=tracing_enabled, db_url=tracing_db_url
+)
+app_state: dict[str, Any] = {
+    "tracing_enabled": tracing_enabled,
+}
+if tracer_factory is not None:
+    app_state["session_tracer_factory"] = tracer_factory
+```
+## Database Schema
+### outcome_rewards Table
+```sql
+CREATE TABLE outcome_rewards (
+    id INTEGER PRIMARY KEY AUTOINCREMENT,
+    session_id VARCHAR NOT NULL,
+    total_reward INTEGER NOT NULL,
+    achievements_count INTEGER NOT NULL,
+    total_steps INTEGER NOT NULL,
+    created_at DATETIME NOT NULL,
+    reward_metadata TEXT,
+    FOREIGN KEY(session_id) REFERENCES session_traces(session_id)
+);
+```
+## Query Examples
+### Get all sessions with rewards
+```sql
+SELECT
+    st.session_id,
+    st.num_timesteps,
+    orw.total_reward,
+    orw.achievements_count,
+    json_extract(orw.reward_metadata, '$.final_map') as final_map
+FROM session_traces st
+INNER JOIN outcome_rewards orw ON st.session_id = orw.session_id
+ORDER BY orw.total_reward DESC;
+```
+### Filter for non-zero rewards (when they exist)
+```sql
+SELECT
+    session_id,
+    total_reward,
+    achievements_count,
+    total_steps,
+    json_extract(reward_metadata, '$.final_map') as final_map,
+    json_extract(reward_metadata, '$.party_count') as party_count
+FROM outcome_rewards
+WHERE total_reward > 0
+ORDER BY total_reward DESC;
+```
+## Comparison: Crafter vs Pokemon Red
+| Feature | Crafter | Pokemon Red |
+|---------|---------|-------------|
+| Image-only mode | ✅ Working | ✅ Working |
+| OpenAI API | ✅ Working | ✅ Working |
+| Eval CLI | ✅ Working | ✅ Working |
+| SessionTracer | ✅ Integrated | ✅ Integrated |
+| Turso database | ✅ 1.7MB (10 rollouts) | ✅ 192KB (10 rollouts) |
+| outcome_rewards | ✅ 10 rows | ✅ 10 rows |
+| Foreign keys | ✅ Working | ✅ Working |
+| Non-zero rewards | ✅ 7/10 rollouts | ❌ 0/10 rollouts* |
+*Expected: Pokemon Red is harder (requires room navigation, NPC dialogue, etc.)
+## Files Modified
+1. **`task_app.py`**:
+   - Added `use_vision` and `image_only_mode` support
+   - Fixed OpenAI API URL construction and auth
+   - Integrated SessionTracer for Turso persistence
+   - Added `record_outcome_reward()` calls
+   - Updated `build_config()` to create tracer_factory
+2. **`eval_image_only_gpt4o.toml`** (new):
+   - Config for image-only evaluation
+   - 10 seeds, 10 max turns per episode
+   - GPT-4o mini with vision enabled
+## Running the Evaluation
+```bash
+cd /Users/joshpurtell/Documents/GitHub/synth-ai
+# Set up tracing environment
+export TASKAPP_TRACING_ENABLED=1
+export TURSO_NATIVE=1
+export SQLD_DB_PATH="traces/v3/pokemon_red_eval.db"
+# Run evaluation
+uv run synth-ai eval pokemon_red \
+  --config examples/task_apps/pokemon_red/eval_image_only_gpt4o.toml
+```
+## Verification Commands
+```bash
+# Check database size
+ls -lh traces/v3/pokemon_red_eval.db
+# Count sessions
+sqlite3 traces/v3/pokemon_red_eval.db \
+  "SELECT COUNT(*) FROM session_traces;"
+# View all rewards
+sqlite3 -header -column traces/v3/pokemon_red_eval.db \
+  "SELECT session_id, total_reward, achievements_count, total_steps
+   FROM outcome_rewards
+   ORDER BY total_reward DESC;"
+# Test foreign keys
+sqlite3 traces/v3/pokemon_red_eval.db \
+  "SELECT st.session_id, orw.total_reward
+   FROM session_traces st
+   INNER JOIN outcome_rewards orw ON st.session_id = orw.session_id
+   LIMIT 5;"
+```
+## Next Steps
+To improve rewards:
+1. **Increase max_turns**: Try 50-100 turns per episode
+2. **Better prompting**: Add more detailed instructions in system prompt
+3. **Hybrid mode**: Use `use_vision=true` with `image_only_mode=false` to get both images and text
+4. **Different model**: Try GPT-4o (full) or Claude 3.5 Sonnet for better vision understanding
+## Summary
+✅ **All goals achieved**:
+- Image-only input mode working
+- 10 rollouts completed successfully
+- Turso database created with 192KB of trace data
+- outcome_rewards table with foreign keys
+- Can filter and query by rewards
+- SessionTracer fully integrated
+Pokemon Red now has the same Turso tracing capabilities as Crafter! 🎉

examples/task_apps/pokemon_red/EVAL_IMAGE_ONLY_STATUS.md ADDED Viewed

@@ -0,0 +1,155 @@
+# Pokemon Red Image-Only Eval Status - ✅ COMPLETE
+**Status**: All features working! See `EVAL_IMAGE_ONLY_COMPLETE.md` for full details.
+---
+# Original Status (Before Turso Integration)
+## ✅ What's Working
+### 1. Image-Only Input Mode
+- Successfully modified `task_app.py` to support `use_vision` and `image_only_mode` config flags
+- When enabled, sends only base64-encoded PNG frames to the LLM (no text observations)
+- Similar to Crafter's implementation
+### 2. OpenAI API Integration
+- Fixed inference URL construction to properly call `https://api.openai.com/v1/chat/completions`
+- Added proper Authorization Bearer token handling
+- Successfully runs 10 rollouts with `gpt-4o-mini-2024-07-18`
+### 3. Eval Configuration
+- Created `eval_image_only_gpt4o.toml` config file
+- Successfully runs via `synth-ai eval pokemon_red --config ...`
+- All 10 seeds complete without errors
+## ⚠️ What's Not Working Yet
+### Turso Tracing & Rewards
+**Issue**: Pokemon Red doesn't use SessionTracer like Crafter does
+**Current State**:
+- Pokemon Red returns a basic trace payload (session_id, metadata) for the CLI
+- But it doesn't actually create or save to a Turso database
+- No `outcome_rewards` table or reward persistence
+- No integration with `SessionTracer` from `tracing_v3`
+**What Would Be Needed**:
+1. Import and initialize `SessionTracer` in Pokemon Red's `rollout_executor`
+2. Call `tracer.start_session()` at beginning of rollout
+3. Record events during rollout (like Crafter does)
+4. Call `tracer.record_outcome_reward()` at end with:
+   - `total_reward`: sum of step rewards
+   - `achievements_count`: count of milestones reached
+   - `total_steps`: number of steps taken
+   - `reward_metadata`: dict with map_id, party_count, badges, etc.
+5. Call `tracer.end_session()` to persist to database
+### Reward Computation
+**Current State**:
+- Pokemon Red has a `PalletTownProgressionCompositeReward` reward function
+- It tracks milestones like leaving bedroom, getting starter Pokemon, etc.
+- But rewards are currently all 0.0 (expected - task is hard with only 10 turns and image-only input)
+**What's Challenging**:
+- The Pallet Town sequence requires:
+  - Navigating multiple rooms
+  - Talking to NPCs (pressing A at right moments)
+  - Selecting starter Pokemon
+  - Entering first battle
+- With only images (no text hints) and 10 LLM calls, agents struggle to make progress
+- May need more turns or better prompting to get non-zero rewards
+## 📊 Current Results
+```
+Eval complete: 10 ok, 0 failed
+Model: gpt-4o-mini-2024-07-18
+Seeds: 0-9 (10 rollouts)
+Mean reward: 0.000
+Outcome score: 0.000
+All rollouts: ~21 steps, 0 rewards, Map 38 (Red's bedroom)
+```
+## 🔧 Files Modified
+1. **`task_app.py`**:
+   - Added `use_vision` and `image_only_mode` support in `_call_inference`
+   - Fixed OpenAI API URL construction
+   - Added basic trace payload generation
+   - **Still needs**: SessionTracer integration for Turso persistence
+2. **`eval_image_only_gpt4o.toml`** (new):
+   - Config for image-only evaluation
+   - 10 seeds, 10 max turns per episode
+   - GPT-4o mini with vision enabled
+## 🚀 Next Steps to Complete Turso Integration
+### Option 1: Quick Fix (Minimal Tracing)
+Just save basic session info without full event tracing:
+```python
+# At start of rollout_executor
+from synth_ai.tracing_v3 import SessionTracer, StorageConfig, StorageBackend
+tracer = SessionTracer(
+    storage_config=StorageConfig(
+        backend=StorageBackend.TURSO_NATIVE,
+        connection_string=f"file:{os.getenv('SQLD_DB_PATH', 'traces/v3/pokemon_red.db')}"
+    ),
+    auto_save=True
+)
+await tracer.initialize()
+session_id = await tracer.start_session(metadata={...})
+# At end of rollout_executor
+await tracer.record_outcome_reward(
+    total_reward=int(total_reward),
+    achievements_count=len(milestone_events),  # or 0 if none
+    total_steps=len(steps),
+    reward_metadata={
+        "final_map": final_state.get("map_id"),
+        "party_count": final_state.get("party_count", 0),
+        "badges": final_state.get("badges", 0),
+        "milestone_events": milestone_events,
+    }
+)
+await tracer.end_session()
+```
+### Option 2: Full Tracing (Like Crafter)
+Integrate complete event tracing like Crafter's rollout.py:
+- Record messages, timesteps, events for each step
+- More complex but provides rich trace data
+- Would require more significant refactoring
+## 📝 Comparison with Crafter
+| Feature | Crafter | Pokemon Red |
+|---------|---------|-------------|
+| Image-only mode | ✅ Working | ✅ Working |
+| OpenAI API | ✅ Working | ✅ Working |
+| Eval CLI | ✅ Working | ✅ Working |
+| SessionTracer | ✅ Integrated | ❌ Not integrated |
+| Turso database | ✅ Saves traces | ❌ No database created |
+| outcome_rewards | ✅ Persisted | ❌ Not saved |
+| Foreign keys | ✅ Working | ❌ N/A |
+| Non-zero rewards | ✅ 7/10 rollouts | ❌ 0/10 rollouts |
+## ✅ Summary
+**Completed**:
+- ✅ Image-only input mode for Pokemon Red
+- ✅ OpenAI API integration with proper auth
+- ✅ Eval CLI runs 10 rollouts successfully
+- ✅ Basic trace payload returned (for CLI)
+**Not Yet Complete**:
+- ❌ Turso database persistence
+- ❌ outcome_rewards table with foreign keys
+- ❌ SessionTracer integration
+- ❌ Queryable rewards by seed
+**To match Crafter's capabilities**, Pokemon Red needs SessionTracer integration (Option 1 or 2 above).

synth-ai 0.2.13.dev2__py3-none-any.whl → 0.2.14__py3-none-any.whl

Potentially problematic release.

synth-ai 0.2.13.dev2py3-none-any.whl → 0.2.14py3-none-any.whl