PyPI - synth-ai - Versions diffs - 0.2.13.dev1__py3-none-any.whl → 0.2.14__py3-none-any.whl - Mend

synth-ai 0.2.13.dev1py3-none-any.whl → 0.2.14py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of synth-ai might be problematic. Click here for more details.

Files changed (291) hide show

examples/task_apps/IMAGE_ONLY_EVAL_QUICKSTART.md ADDED Viewed

@@ -0,0 +1,258 @@
+# Image-Only Evaluation - Quick Reference
+This document provides a quick reference for running image-only evaluations on **Crafter** and **Pokemon Red** with Turso tracing.
+## 📚 Full Documentation
+- **Crafter**: [`crafter/README_IMAGE_ONLY_EVAL.md`](crafter/README_IMAGE_ONLY_EVAL.md)
+- **Pokemon Red**: [`pokemon_red/README_IMAGE_ONLY_EVAL.md`](pokemon_red/README_IMAGE_ONLY_EVAL.md)
+## ⚡ Quick Start
+### Prerequisites
+```bash
+# 1. Set OpenAI API key in .env
+echo "OPENAI_API_KEY=sk-proj-..." >> .env
+# 2. Navigate to synth-ai repo
+cd /path/to/synth-ai
+```
+### Run Crafter (Easier - 70% Success Rate)
+```bash
+# Set up tracing
+export TASKAPP_TRACING_ENABLED=1
+export TURSO_NATIVE=1
+export SQLD_DB_PATH="traces/v3/crafter_eval.db"
+# Run evaluation
+uv run synth-ai eval grpo-crafter \
+  --config examples/task_apps/crafter/eval_image_only_gpt4o.toml
+# Check results
+sqlite3 -header -column traces/v3/crafter_eval.db \
+  "SELECT total_reward, achievements_count,
+   json_extract(reward_metadata, '$.final_achievements') as achievements
+   FROM outcome_rewards WHERE total_reward > 0;"
+```
+### Run Pokemon Red (Harder - 0% with Default Config)
+```bash
+# Set up tracing
+export TASKAPP_TRACING_ENABLED=1
+export TURSO_NATIVE=1
+export SQLD_DB_PATH="traces/v3/pokemon_red_eval.db"
+# Run evaluation
+uv run synth-ai eval pokemon_red \
+  --config examples/task_apps/pokemon_red/eval_image_only_gpt4o.toml
+# Check results
+sqlite3 -header -column traces/v3/pokemon_red_eval.db \
+  "SELECT total_reward, achievements_count,
+   json_extract(reward_metadata, '$.final_map') as map,
+   json_extract(reward_metadata, '$.party_count') as party
+   FROM outcome_rewards;"
+```
+## 📊 Comparison
+| Feature | Crafter | Pokemon Red |
+|---------|---------|-------------|
+| **Difficulty** | Easier | Harder |
+| **Default success** | ~70% earn rewards | ~0% (needs tuning) |
+| **Typical reward** | 1-3 achievements | 0 (10 steps too short) |
+| **Best for** | Testing vision models | RL research |
+| **Recommended steps** | 10 (default works) | 100-500 (need more) |
+## 🔧 Configuration Files
+### Crafter Config
+**Location**: `examples/task_apps/crafter/eval_image_only_gpt4o.toml`
+```toml
+[eval]
+app_id = "grpo-crafter"
+model = "gpt-4o-mini-2024-07-18"
+seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+max_turns = 10
+env_name = "crafter"
+policy_name = "crafter-react"
+[eval.policy_config]
+use_vision = true
+image_only_mode = true  # Only images, no text
+```
+### Pokemon Red Config
+**Location**: `examples/task_apps/pokemon_red/eval_image_only_gpt4o.toml`
+```toml
+[eval]
+app_id = "pokemon_red"
+model = "gpt-4o-mini-2024-07-18"
+seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+max_turns = 10
+env_name = "pokemon_red"
+[eval.policy_config]
+use_vision = true
+image_only_mode = true  # Only images, no text
+```
+## 📈 Improving Pokemon Red Results
+Pokemon Red is harder and needs more steps. To get non-zero rewards:
+```toml
+[eval]
+model = "gpt-4o-2024-08-06"  # Use full GPT-4o
+max_turns = 100
+[eval.env_config]
+env_params = {max_steps_per_episode = 500}
+[eval.policy_config]
+model = "gpt-4o-2024-08-06"
+image_only_mode = false  # Enable text too (multimodal)
+max_llm_calls = 100
+```
+## 🗄️ Database Queries
+### Get All Rewards
+```sql
+-- Crafter
+SELECT
+    json_extract(reward_metadata, '$.env_seed') as seed,
+    total_reward,
+    achievements_count,
+    json_extract(reward_metadata, '$.final_achievements') as achievements
+FROM outcome_rewards
+ORDER BY total_reward DESC;
+-- Pokemon Red
+SELECT
+    session_id,
+    total_reward,
+    achievements_count,
+    json_extract(reward_metadata, '$.final_map') as map,
+    json_extract(reward_metadata, '$.party_count') as party
+FROM outcome_rewards
+ORDER BY total_reward DESC;
+```
+### Filter Non-Zero Rewards
+```sql
+SELECT * FROM outcome_rewards WHERE total_reward > 0;
+```
+### Get Statistics
+```sql
+SELECT
+    COUNT(*) as total,
+    SUM(CASE WHEN total_reward > 0 THEN 1 ELSE 0 END) as with_rewards,
+    AVG(total_reward) as avg_reward,
+    MAX(total_reward) as max_reward
+FROM outcome_rewards;
+```
+## 🎯 What is Image-Only Mode?
+**Image-Only Mode** means:
+- ✅ Agent receives **only** base64-encoded PNG images
+- ❌ Agent receives **no** text observations (HP, position, inventory, etc.)
+- 🎓 Tests pure vision understanding
+**Multimodal Mode** (recommended for Pokemon Red):
+- ✅ Agent receives **both** images and text
+- 🏆 Better performance but "easier"
+Toggle with:
+```toml
+[eval.policy_config]
+use_vision = true         # Enable vision
+image_only_mode = false   # false = send text too
+```
+## 📁 Files Created
+### Crafter
+- `crafter/eval_image_only_gpt4o.toml` - Config
+- `crafter/README_IMAGE_ONLY_EVAL.md` - Full guide
+- `crafter/EVAL_IMAGE_ONLY_RESULTS.md` - Example results
+- `crafter/QUERY_EXAMPLES.md` - SQL queries
+### Pokemon Red
+- `pokemon_red/eval_image_only_gpt4o.toml` - Config
+- `pokemon_red/README_IMAGE_ONLY_EVAL.md` - Full guide
+- `pokemon_red/EVAL_IMAGE_ONLY_COMPLETE.md` - Implementation
+- `pokemon_red/EVAL_IMAGE_ONLY_STATUS.md` - Status
+## 🐛 Common Issues
+### Database Not Created
+```bash
+# Ensure variables are set
+export TASKAPP_TRACING_ENABLED=1
+export TURSO_NATIVE=1
+export SQLD_DB_PATH="traces/v3/your_eval.db"
+```
+### 401 Unauthorized
+```bash
+# Check API key in .env
+cat .env | grep OPENAI_API_KEY
+```
+### Pokemon Red: ROM Not Found
+```bash
+# Place ROM at expected location
+cp pokemon_red.gb synth_ai/environments/examples/red/roms/
+```
+### All Rewards Zero
+- **Crafter**: Should get ~70% non-zero by default
+- **Pokemon Red**: Expected with 10 steps - increase to 100-500
+## 🎓 Understanding Results
+### Crafter Achievements
+- `collect_wood` - Cut down trees
+- `collect_sapling` - Collect tree saplings
+- `collect_drink` - Drink from water
+### Pokemon Red Milestones
+- Leave bedroom (+20)
+- Exit house (+30)
+- Find Oak's lab (+40)
+- Get starter Pokemon (+100)
+- Win first battle (+150)
+**Total possible**: ~600 points
+## 🚀 Next Steps
+1. **Read full docs**: See task-specific READMEs for details
+2. **Run evaluations**: Start with Crafter (easier)
+3. **Query database**: Use SQL to analyze results
+4. **Tune configs**: Adjust steps/model for better performance
+5. **Compare modes**: Try image-only vs multimodal
+## 📞 Support
+For issues or questions:
+1. Check full README for your task app
+2. Review example results files
+3. Query database to verify data
+4. Adjust config parameters
+Happy evaluating! 🎮

examples/task_apps/TESTING.md ADDED Viewed

@@ -0,0 +1,275 @@
+# Task App Testing Guide
+This document describes how to run tests for the task apps in this directory.
+## Overview
+Each task app has unit and integration tests following a consistent pattern inspired by the customer environment tests in `customers/`.
+## Test Structure
+```
+examples/task_apps/<app_name>/tests/
+├── __init__.py
+├── integration/
+│   ├── __init__.py
+│   └── test_<app>_eval.py      # Server startup + eval tests
+└── unit/
+    ├── __init__.py
+    └── test_<app>_*.py          # Environment, scoring, dataset tests
+```
+## Running Tests
+### Prerequisites
+```bash
+# Install test dependencies
+uv sync --dev
+# Set required environment variables
+export GROQ_API_KEY="your-groq-key"
+export OPENAI_API_KEY="your-openai-key"  # For Sokoban
+```
+### Run All Tests for a Task App
+```bash
+# Verilog
+pytest examples/task_apps/verilog/tests/ -v
+# Enron
+pytest examples/task_apps/enron/tests/ -v
+# Sokoban
+pytest examples/task_apps/sokoban/tests/ -v
+```
+### Run Only Unit Tests (Fast)
+```bash
+# Runs quickly, no server startup required
+pytest examples/task_apps/verilog/tests/unit/ -v
+pytest examples/task_apps/enron/tests/unit/ -v
+pytest examples/task_apps/sokoban/tests/unit/ -v
+```
+### Run Only Integration Tests
+```bash
+# Slower, starts servers and runs evals
+pytest examples/task_apps/verilog/tests/integration/ -v
+pytest examples/task_apps/enron/tests/integration/ -v
+pytest examples/task_apps/sokoban/tests/integration/ -v
+```
+### Run All Task App Tests
+```bash
+# Run everything
+pytest examples/task_apps/*/tests/ -v
+# Skip slow tests
+pytest examples/task_apps/*/tests/ -v -m "not slow"
+```
+## Test Categories
+### Unit Tests
+**Purpose**: Test individual components in isolation
+- Environment initialization
+- Reward calculation
+- Tool implementations
+- State management
+**Characteristics**:
+- Fast (< 1 second each)
+- No external dependencies
+- No server startup
+- No API calls
+**Examples**:
+- `test_verilog_scoring.py`: Tests reward components (compile, simulate, submit)
+- `test_enron_environment.py`: Tests search, answer, reward calculation
+- `test_sokoban_environment.py`: Tests actions, rewards, truncation
+### Integration Tests
+**Purpose**: Test the full system end-to-end
+- Server startup
+- Health/info endpoints
+- Full evaluation runs
+- **Rollout execution** (manual and policy-driven)
+**Characteristics**:
+- Slower (30-300 seconds)
+- Requires server startup
+- May require API keys
+- Tests real workflows
+**Examples**:
+- `test_verilog_eval.py`: Starts server, runs Groq eval with Qwen3-32B
+- `test_verilog_rollout.py`: **Manual & policy rollouts via /rollout endpoint**
+- `test_enron_eval.py`: Starts server, runs Groq eval
+- `test_enron_rollout.py`: **Manual & policy rollouts, auth testing**
+- `test_sokoban_eval.py`: Starts server, tests manual rollout
+- `test_sokoban_rollout.py`: **6 rollout tests (manual, policy, difficulties, limits)**
+## What Each Test Validates
+### Verilog Tests
+**Unit Tests** (4 tests):
+- ✅ Compile success gives +0.1 reward
+- ✅ Simulation pass gives +1.0 reward
+- ✅ Submit success gives +10.0 reward
+- ✅ Submit checks last simulation output correctly
+**Integration Tests** (5 tests):
+- ✅ Server starts and responds to /health
+- ✅ /task_info returns valid Verilog task metadata
+- ✅ Full eval with Qwen3-32B completes successfully
+- ✅ **Manual rollout** with explicit write/compile/simulate/submit
+- ✅ **Policy rollout** using Groq/Qwen3-32B (verifies LLM integration)
+### Enron Tests
+**Unit Tests** (3 tests):
+- ✅ search_emails tool works correctly
+- ✅ answer_question tool calculates rewards
+- ✅ Exact answer match gives high reward (>0.9)
+- ✅ Partial answer match gives medium reward (>0.5)
+- ✅ Wrong answer gives low reward (<0.5)
+**Integration Tests** (6 tests):
+- ✅ Server starts and responds to /health
+- ✅ /task_info returns valid Enron task metadata
+- ✅ Full eval with Qwen3-32B completes successfully
+- ✅ **Manual rollout** with explicit search/read/answer actions
+- ✅ **Policy rollout** using Groq/Qwen3-32B
+- ✅ **Authentication** enforcement (rejects requests without auth header)
+### Sokoban Tests
+**Unit Tests** (3 tests):
+- ✅ Module imports work correctly
+- ✅ Reward components exist (goal achieved, step penalty)
+- ✅ Engine creation with different difficulty levels
+**Integration Tests** (9 tests):
+- ✅ Server starts and responds to /health
+- ✅ /task_info returns valid Sokoban task metadata
+- ✅ **Manual rollout** with movement actions (left/right/up/down)
+- ✅ **Policy rollout** with OpenAI GPT-5-mini (may skip if slow)
+- ✅ **All difficulty levels** (easy/medium/hard) work correctly
+- ✅ **Max steps limit** enforcement (stops at configured limit)
+- ✅ **Puzzle completion detection** (terminated=True when solved)
+- ✅ Truncation on max_steps
+- ✅ Response structure validation
+## Debugging Test Failures
+### Server Won't Start
+```bash
+# Check if port is already in use
+lsof -i :<port>
+# Check logs manually
+uv run -m synth_ai task-app serve <app_name> --port 8999
+# Check environment variables
+echo $GROQ_API_KEY
+echo $OPENAI_API_KEY
+```
+### Tests Timeout
+```bash
+# Run with more verbose output
+pytest <test_file> -v -s
+# Skip slow tests
+pytest <test_file> -v --timeout=60
+```
+### Import Errors
+```bash
+# Ensure you're in the right directory
+cd /path/to/synth-ai
+# Reinstall dependencies
+uv sync --dev
+```
+## CI/CD Integration
+These tests can be run in CI with:
+```yaml
+# .github/workflows/test-task-apps.yml
+- name: Run task app tests
+  env:
+    GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
+    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+  run: |
+    # Unit tests (fast, always run)
+    pytest examples/task_apps/*/tests/unit/ -v
+    # Integration tests (slower, only on main)
+    if [ "$GITHUB_REF" = "refs/heads/main" ]; then
+      pytest examples/task_apps/*/tests/integration/ -v --timeout=300
+    fi
+```
+## Adding Tests for New Task Apps
+When creating a new task app, follow this pattern:
+1. **Create test structure**:
+   ```bash
+   mkdir -p examples/task_apps/<new_app>/tests/{unit,integration}
+   touch examples/task_apps/<new_app>/tests/__init__.py
+   touch examples/task_apps/<new_app>/tests/unit/__init__.py
+   touch examples/task_apps/<new_app>/tests/integration/__init__.py
+   ```
+2. **Create unit tests** (`tests/unit/test_<app>_*.py`):
+   - Test environment initialization
+   - Test reward calculation
+   - Test tool implementations
+   - Test edge cases
+3. **Create integration tests** (`tests/integration/test_<app>_eval.py`):
+   - Copy from an existing integration test
+   - Update app name, port, config path
+   - Add app-specific endpoint tests
+4. **Add to CI**:
+   - Update CI config to include new tests
+   - Ensure required env vars are set
+## Test Coverage Goals
+- Unit test coverage: >80%
+- Integration test coverage: 100% of critical paths
+- All public APIs have at least one integration test
+- All reward components have unit tests
+## Common Issues
+### "Task app terminated immediately"
+- Check that the app name is correct
+- Verify the app is registered in `synth_ai/task/apps.py`
+- Check recent changes to the app code
+### "GROQ_API_KEY must be set"
+- Set the environment variable
+- Or skip the test: `pytest -k "not groq"`
+### "Config file not found"
+- Ensure eval config exists in task app directory
+- Check the path in the test matches actual location

examples/task_apps/__init__.py ADDED Viewed

File without changes

synth-ai 0.2.13.dev1__py3-none-any.whl → 0.2.14__py3-none-any.whl

Potentially problematic release.

synth-ai 0.2.13.dev1py3-none-any.whl → 0.2.14py3-none-any.whl