PyPI - synth-ai - Versions diffs - 0.2.12__py3-none-any.whl → 0.2.13.dev2__py3-none-any.whl - Mend

synth-ai 0.2.12py3-none-any.whl → 0.2.13.dev2py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of synth-ai might be problematic. Click here for more details.

Files changed (229) hide show

examples/task_apps/TESTING.md ADDED Viewed

@@ -0,0 +1,275 @@
+# Task App Testing Guide
+This document describes how to run tests for the task apps in this directory.
+## Overview
+Each task app has unit and integration tests following a consistent pattern inspired by the customer environment tests in `customers/`.
+## Test Structure
+```
+examples/task_apps/<app_name>/tests/
+├── __init__.py
+├── integration/
+│   ├── __init__.py
+│   └── test_<app>_eval.py      # Server startup + eval tests
+└── unit/
+    ├── __init__.py
+    └── test_<app>_*.py          # Environment, scoring, dataset tests
+```
+## Running Tests
+### Prerequisites
+```bash
+# Install test dependencies
+uv sync --dev
+# Set required environment variables
+export GROQ_API_KEY="your-groq-key"
+export OPENAI_API_KEY="your-openai-key"  # For Sokoban
+```
+### Run All Tests for a Task App
+```bash
+# Verilog
+pytest examples/task_apps/verilog/tests/ -v
+# Enron
+pytest examples/task_apps/enron/tests/ -v
+# Sokoban
+pytest examples/task_apps/sokoban/tests/ -v
+```
+### Run Only Unit Tests (Fast)
+```bash
+# Runs quickly, no server startup required
+pytest examples/task_apps/verilog/tests/unit/ -v
+pytest examples/task_apps/enron/tests/unit/ -v
+pytest examples/task_apps/sokoban/tests/unit/ -v
+```
+### Run Only Integration Tests
+```bash
+# Slower, starts servers and runs evals
+pytest examples/task_apps/verilog/tests/integration/ -v
+pytest examples/task_apps/enron/tests/integration/ -v
+pytest examples/task_apps/sokoban/tests/integration/ -v
+```
+### Run All Task App Tests
+```bash
+# Run everything
+pytest examples/task_apps/*/tests/ -v
+# Skip slow tests
+pytest examples/task_apps/*/tests/ -v -m "not slow"
+```
+## Test Categories
+### Unit Tests
+**Purpose**: Test individual components in isolation
+- Environment initialization
+- Reward calculation
+- Tool implementations
+- State management
+**Characteristics**:
+- Fast (< 1 second each)
+- No external dependencies
+- No server startup
+- No API calls
+**Examples**:
+- `test_verilog_scoring.py`: Tests reward components (compile, simulate, submit)
+- `test_enron_environment.py`: Tests search, answer, reward calculation
+- `test_sokoban_environment.py`: Tests actions, rewards, truncation
+### Integration Tests
+**Purpose**: Test the full system end-to-end
+- Server startup
+- Health/info endpoints
+- Full evaluation runs
+- **Rollout execution** (manual and policy-driven)
+**Characteristics**:
+- Slower (30-300 seconds)
+- Requires server startup
+- May require API keys
+- Tests real workflows
+**Examples**:
+- `test_verilog_eval.py`: Starts server, runs Groq eval with Qwen3-32B
+- `test_verilog_rollout.py`: **Manual & policy rollouts via /rollout endpoint**
+- `test_enron_eval.py`: Starts server, runs Groq eval
+- `test_enron_rollout.py`: **Manual & policy rollouts, auth testing**
+- `test_sokoban_eval.py`: Starts server, tests manual rollout
+- `test_sokoban_rollout.py`: **6 rollout tests (manual, policy, difficulties, limits)**
+## What Each Test Validates
+### Verilog Tests
+**Unit Tests** (4 tests):
+- ✅ Compile success gives +0.1 reward
+- ✅ Simulation pass gives +1.0 reward
+- ✅ Submit success gives +10.0 reward
+- ✅ Submit checks last simulation output correctly
+**Integration Tests** (5 tests):
+- ✅ Server starts and responds to /health
+- ✅ /task_info returns valid Verilog task metadata
+- ✅ Full eval with Qwen3-32B completes successfully
+- ✅ **Manual rollout** with explicit write/compile/simulate/submit
+- ✅ **Policy rollout** using Groq/Qwen3-32B (verifies LLM integration)
+### Enron Tests
+**Unit Tests** (3 tests):
+- ✅ search_emails tool works correctly
+- ✅ answer_question tool calculates rewards
+- ✅ Exact answer match gives high reward (>0.9)
+- ✅ Partial answer match gives medium reward (>0.5)
+- ✅ Wrong answer gives low reward (<0.5)
+**Integration Tests** (6 tests):
+- ✅ Server starts and responds to /health
+- ✅ /task_info returns valid Enron task metadata
+- ✅ Full eval with Qwen3-32B completes successfully
+- ✅ **Manual rollout** with explicit search/read/answer actions
+- ✅ **Policy rollout** using Groq/Qwen3-32B
+- ✅ **Authentication** enforcement (rejects requests without auth header)
+### Sokoban Tests
+**Unit Tests** (3 tests):
+- ✅ Module imports work correctly
+- ✅ Reward components exist (goal achieved, step penalty)
+- ✅ Engine creation with different difficulty levels
+**Integration Tests** (9 tests):
+- ✅ Server starts and responds to /health
+- ✅ /task_info returns valid Sokoban task metadata
+- ✅ **Manual rollout** with movement actions (left/right/up/down)
+- ✅ **Policy rollout** with OpenAI GPT-5-mini (may skip if slow)
+- ✅ **All difficulty levels** (easy/medium/hard) work correctly
+- ✅ **Max steps limit** enforcement (stops at configured limit)
+- ✅ **Puzzle completion detection** (terminated=True when solved)
+- ✅ Truncation on max_steps
+- ✅ Response structure validation
+## Debugging Test Failures
+### Server Won't Start
+```bash
+# Check if port is already in use
+lsof -i :<port>
+# Check logs manually
+uv run -m synth_ai task-app serve <app_name> --port 8999
+# Check environment variables
+echo $GROQ_API_KEY
+echo $OPENAI_API_KEY
+```
+### Tests Timeout
+```bash
+# Run with more verbose output
+pytest <test_file> -v -s
+# Skip slow tests
+pytest <test_file> -v --timeout=60
+```
+### Import Errors
+```bash
+# Ensure you're in the right directory
+cd /path/to/synth-ai
+# Reinstall dependencies
+uv sync --dev
+```
+## CI/CD Integration
+These tests can be run in CI with:
+```yaml
+# .github/workflows/test-task-apps.yml
+- name: Run task app tests
+  env:
+    GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
+    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+  run: |
+    # Unit tests (fast, always run)
+    pytest examples/task_apps/*/tests/unit/ -v
+    # Integration tests (slower, only on main)
+    if [ "$GITHUB_REF" = "refs/heads/main" ]; then
+      pytest examples/task_apps/*/tests/integration/ -v --timeout=300
+    fi
+```
+## Adding Tests for New Task Apps
+When creating a new task app, follow this pattern:
+1. **Create test structure**:
+   ```bash
+   mkdir -p examples/task_apps/<new_app>/tests/{unit,integration}
+   touch examples/task_apps/<new_app>/tests/__init__.py
+   touch examples/task_apps/<new_app>/tests/unit/__init__.py
+   touch examples/task_apps/<new_app>/tests/integration/__init__.py
+   ```
+2. **Create unit tests** (`tests/unit/test_<app>_*.py`):
+   - Test environment initialization
+   - Test reward calculation
+   - Test tool implementations
+   - Test edge cases
+3. **Create integration tests** (`tests/integration/test_<app>_eval.py`):
+   - Copy from an existing integration test
+   - Update app name, port, config path
+   - Add app-specific endpoint tests
+4. **Add to CI**:
+   - Update CI config to include new tests
+   - Ensure required env vars are set
+## Test Coverage Goals
+- Unit test coverage: >80%
+- Integration test coverage: 100% of critical paths
+- All public APIs have at least one integration test
+- All reward components have unit tests
+## Common Issues
+### "Task app terminated immediately"
+- Check that the app name is correct
+- Verify the app is registered in `synth_ai/task/apps.py`
+- Check recent changes to the app code
+### "GROQ_API_KEY must be set"
+- Set the environment variable
+- Or skip the test: `pytest -k "not groq"`
+### "Config file not found"
+- Ensure eval config exists in task app directory
+- Check the path in the test matches actual location

examples/task_apps/__init__.py ADDED Viewed

File without changes

examples/task_apps/crafter/__init__.py ADDED Viewed

File without changes

examples/task_apps/crafter/task_app/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ """Crafter task app implementation."""
2	+

examples/{warming_up_to_rl → task_apps/crafter}/task_app/grpo_crafter.py RENAMED Viewed

@@ -68,7 +68,7 @@ def _resolve_repo_root() -> Path:
 def _resolve_task_app_root(repo_root: Path) -> Path:
     """Locate the task_app directory even when the module is copied to a temp mount."""
-    preferred = (repo_root / "examples" / "warming_up_to_rl" / "task_app").resolve()
+    preferred = (repo_root / "examples" / "task_apps" / "crafter" / "task_app").resolve()
     if preferred.is_dir():
         return preferred
@@ -81,7 +81,7 @@ def _resolve_task_app_root(repo_root: Path) -> Path:
         if (candidate / "synth_envs_hosted").is_dir():
             return candidate
-    fallback = Path("/opt/synth_ai_repo/examples/warming_up_to_rl/task_app")
+    fallback = Path("/opt/synth_ai_repo/examples/task_apps/crafter/task_app")
     if fallback.is_dir():
         return fallback.resolve()
@@ -93,6 +93,7 @@ TASK_APP_ROOT = _resolve_task_app_root(REPO_ROOT)
 SYNTH_ENVS_HOSTED_ROOT = (TASK_APP_ROOT / "synth_envs_hosted").resolve()
 EXAMPLES_ROOT = (REPO_ROOT / "examples").resolve()
+RUBRICS_ROOT = (EXAMPLES_ROOT / "multi_step" / "rubrics").resolve()
 for path in (REPO_ROOT, TASK_APP_ROOT, SYNTH_ENVS_HOSTED_ROOT, EXAMPLES_ROOT):
     try:
@@ -305,13 +306,16 @@ def build_dataset() -> tuple[TaskDatasetRegistry, CrafterDataset]:
 def _base_task_info(dataset: CrafterDataset) -> TaskInfo:
     return TaskInfo(
         task={"id": "crafter_classic", "name": "Crafter Classic", "version": "1.0.0"},
-        environments=["crafter"],
+        environment="crafter",
         action_space={
             "type": "discrete",
+            "description": f"Discrete action space with {len(crafter_constants.actions)} actions including movement, crafting, and interaction",
             "size": len(crafter_constants.actions),
             "actions": list(crafter_constants.actions),
         },
         observation={
+            "type": "dict",
+            "description": "RGB frame (64x64x3) plus inventory counts, achievements, and semantic map patches",
             "summary": "RGB frame plus inventory, achievements, and semantic map patches.",
             "keys": ["image", "inventory", "achievements", "semantic_map_patch7"],
             "image_shape": [64, 64, 3],
@@ -335,49 +339,13 @@ def _base_task_info(dataset: CrafterDataset) -> TaskInfo:
             },
             "tool": {"name": "interact", "parallel_tool_calls": False},
         },
-        capabilities={
-            "supports_rollout": True,
-            "supports_env_lifecycle": True,
-            "requires_api_key_header": True,
-        },
         limits={"max_ops": 100000, "max_time_s": 3600},
     )
-OUTCOME_RUBRIC = load_rubric(
-    {
-        "version": "1",
-        "goal_text": "Reward unlocking Crafter achievements and survival.",
-        "aggregation": "weighted_sum",
-        "criteria": [
-            {
-                "id": "achievements",
-                "description": "Unlock achievements or crafting milestones.",
-                "weight": 1.0,
-            },
-            {
-                "id": "survival",
-                "description": "Maintain health, food, and drink levels.",
-                "weight": 1.0,
-            },
-        ],
-    }
-)
+OUTCOME_RUBRIC = load_rubric(str(RUBRICS_ROOT / "crafter_outcome_rubric.json"))
-EVENTS_RUBRIC = load_rubric(
-    {
-        "version": "1",
-        "goal_text": "Encourage purposeful step-wise exploration and crafting.",
-        "aggregation": "weighted_sum",
-        "criteria": [
-            {
-                "id": "progress_steps",
-                "description": "Actions progress quests, crafting, or exploration.",
-                "weight": 1.0,
-            }
-        ],
-    }
-)
+EVENTS_RUBRIC = load_rubric(str(RUBRICS_ROOT / "crafter_events_rubric.json"))
 def describe_taskset(dataset: CrafterDataset) -> dict[str, Any]:
@@ -396,29 +364,36 @@ def provide_task_instances(
     dataset: CrafterDataset, base_info: TaskInfo, seeds: Sequence[int]
 ) -> Iterable[TaskInfo]:
     infos: list[TaskInfo] = []
+    base_observation = getattr(base_info, "observation", None)
+    if hasattr(base_observation, "model_dump"):
+        observation_template = base_observation.model_dump()
+    elif isinstance(base_observation, dict):
+        observation_template = dict(base_observation)
+    else:
+        observation_template = {}
     for seed_value in seeds:
         summary = dataset.describe_seed(seed_value)
         infos.append(
             TaskInfo(
                 task=base_info.task,
-                environments=base_info.environments,
+                environment=base_info.environment,
                 action_space=base_info.action_space,
                 observation={
-                    **base_info.observation,
+                    **observation_template,
                     "seed": seed_value,
                     "traits": summary["traits"],
                     "inventory": summary["inventory"],
                     "player_position": summary["player_position"],
                 },
                 dataset={
-                    **base_info.dataset,
+                    **base_info.dataset.model_dump(),
                     "seed": seed_value,
                     "difficulty": summary["difficulty"],
                     "config": summary["config"],
                 },
                 rubric=base_info.rubric,
                 inference=base_info.inference,
-                capabilities=base_info.capabilities,
                 limits=base_info.limits,
             )
         )
@@ -689,7 +664,7 @@ register_task_app(
                 # Mount repo root so local modules resolve when deployed on Modal
                 (str(REPO_ROOT), "/opt/synth_ai_repo"),
                 (str(REPO_ROOT / "synth_ai"), "/opt/synth_ai_repo/synth_ai"),
-                (str(TASK_APP_ROOT), "/opt/synth_ai_repo/examples/warming_up_to_rl/task_app"),
+                (str(TASK_APP_ROOT), "/opt/synth_ai_repo/examples/task_apps/crafter/task_app"),
             ),
             secret_names=("groq-api-key", "openai-api-key"),
             memory=16384,

examples/{warming_up_to_rl → task_apps/crafter}/task_app/grpo_crafter_task_app.py RENAMED Viewed

@@ -1,7 +1,7 @@
 """Compatibility wrapper for the GRPO Crafter task app.
 This module now delegates to the TaskAppConfig defined in the colocated example at
-`examples/warming_up_to_rl/task_app/grpo_crafter.py`. It is kept for legacy usage
+`examples/task_apps/crafter/task_app/grpo_crafter.py`. It is kept for legacy usage
 (running the file directly or targeting `fastapi_app` from external tooling). Prefer using
 `uvx synth-ai serve grpo-crafter` for local development and testing.
 """

examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/policy.py RENAMED Viewed

@@ -44,6 +44,7 @@ class CrafterPolicy(Policy):
         self.inference_url = inference_url
         self.model = model
         self.use_tools = True
+        self.use_vision = False  # Enable vision for VLMs
         # Sampling parameters (populated via initialize(config))
         self.temperature: float | None = None
         self.top_p: float | None = None
@@ -63,6 +64,11 @@ class CrafterPolicy(Policy):
             self.model = config["model"]
         if "use_tools" in config:
             self.use_tools = bool(config["use_tools"])
+        if "use_vision" in config:
+            self.use_vision = bool(config["use_vision"])
+        # Auto-detect vision capability from model name if not explicitly set
+        if "use_vision" not in config and self.model:
+            self.use_vision = self._is_vision_model(self.model)
         # Adopt sampling params from policy config (trainer passes these through)
         if "temperature" in config:
             self.temperature = float(config["temperature"])  # fail fast on bad types
@@ -384,6 +390,7 @@ class CrafterPolicy(Policy):
                 "inference_url": self.inference_url,
                 "model": self.model,
                 "use_tools": self.use_tools,
+                "use_vision": self.use_vision,
             },
             "state": self.state_dict(),
         }
@@ -396,7 +403,8 @@ class CrafterPolicy(Policy):
             inference_url=config["inference_url"],
             model=config.get("model"),
         )
-        policy.use_tools = bool(config["use_tools"])
+        policy.use_tools = bool(config.get("use_tools", True))
+        policy.use_vision = bool(config.get("use_vision", False))
         policy.load_state_dict(state)
         return policy
@@ -446,12 +454,60 @@ class CrafterPolicy(Policy):
         return format_observation(obs_data, step_count=step_idx, max_steps=max_steps)
+    @staticmethod
+    def _is_vision_model(model_name: str) -> bool:
+        """Check if a model supports vision/image inputs based on its name."""
+        if not model_name:
+            return False
+        model_lower = model_name.lower()
+        # Known vision-capable model patterns
+        vision_patterns = [
+            "gpt-4o",           # GPT-4o series
+            "gpt-4-turbo",      # GPT-4 Turbo with vision
+            "gpt-4-vision",     # Explicit vision variant
+            "gpt-5",            # GPT-5 series (all variants support vision)
+            "claude-3",         # All Claude 3 models support vision
+            "gemini",           # Gemini models
+            "qwen-vl",          # Qwen Vision-Language models
+            "qwen2-vl",         # Qwen2 VL
+            "pixtral",          # Mistral's vision model
+            "llava",            # LLaVA models
+            "phi-3-vision",     # Microsoft Phi-3 Vision
+            "internvl",         # InternVL models
+            "cogvlm",           # CogVLM models
+            "vision",           # Generic vision indicator
+        ]
+        return any(pattern in model_lower for pattern in vision_patterns)
     def _extract_image_parts(
         self, observation: dict[str, Any] | None
     ) -> list[dict[str, Any]]:
-        """Crafter policy uses text-only prompts; do not attach image parts."""
-        return []
+        """Extract image parts from crafter observation for vision-capable models.
+        Returns OpenAI-style image_url format if vision is enabled and image data is available.
+        """
+        # Only extract images if vision is enabled for this policy
+        if not self.use_vision:
+            return []
+        if not observation:
+            return []
+        # Get the observation data (could be nested)
+        obs = observation.get("observation", observation)
+        if not isinstance(obs, dict):
+            return []
+        # Extract the data URL (includes base64-encoded image)
+        data_url = obs.get("observation_image_data_url")
+        if not data_url or not isinstance(data_url, str):
+            return []
+        # Return OpenAI-style image_url format
+        return [{"type": "image_url", "image_url": {"url": data_url}}]
     def parse_model_response(
         self, response: str, observation: dict[str, Any]

synth-ai 0.2.12__py3-none-any.whl → 0.2.13.dev2__py3-none-any.whl

Potentially problematic release.

synth-ai 0.2.12py3-none-any.whl → 0.2.13.dev2py3-none-any.whl