PyPI - openadapt-ml - Versions diffs - 0.1.0__tar.gz → 0.2.0__tar.gz - Mend

openadapt-ml 0.1.0tar.gz → 0.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (233) hide show

{openadapt_ml-0.1.0 → openadapt_ml-0.2.0}/.gitignore RENAMED Viewed

@@ -39,3 +39,4 @@ debug_*/
 # Internal documentation (not for public repo)
 docs/internal/
+docs/internal/

{openadapt_ml-0.1.0 → openadapt_ml-0.2.0}/CLAUDE.md RENAMED Viewed

@@ -2,6 +2,108 @@
 This file helps maintain context across sessions.
+---
+## ⚠️⚠️⚠️ MANDATORY: START DASHBOARD FIRST ⚠️⚠️⚠️
+### STOP. READ THIS BEFORE DOING ANYTHING.
+**If ANY of these are true, you MUST run the dashboard command IMMEDIATELY:**
+- Session just started or was compacted
+- User mentions VMs, Azure, WAA, benchmark, or Windows
+- You're about to run ANY `vm` subcommand (probe, diag, logs, run-waa, etc.)
+- You want to check benchmark status
+**THE COMMAND (run this FIRST, not after other commands):**
+```bash
+uv run python -m openadapt_ml.benchmarks.cli vm monitor
+```
+**WHY THIS MATTERS:**
+- VNC is ONLY accessible via SSH tunnel at `localhost:8006` (NOT the public IP)
+- The dashboard auto-manages SSH tunnels
+- Without it, you cannot see what Windows is doing
+- The user WILL be frustrated if you keep forgetting this
+**WRONG (what you keep doing):**
+```bash
+# DON'T do this - checking probe/diag/logs WITHOUT dashboard running
+uv run python -m openadapt_ml.benchmarks.cli vm probe
+uv run python -m openadapt_ml.benchmarks.cli vm diag
+# Then telling user to "run vm monitor" - NO! YOU run it FIRST!
+```
+**RIGHT (what you should do):**
+```bash
+# ALWAYS start dashboard FIRST, then it handles everything
+uv run python -m openadapt_ml.benchmarks.cli vm monitor
+```
+**After every /compact or session restart, your LITERAL FIRST ACTION must be starting this dashboard if VMs are involved.**
+---
+## 🚨🚨🚨 STOP! READ THIS BEFORE EVERY COMMAND 🚨🚨🚨
+### ABSOLUTELY NEVER USE RAW SSH COMMANDS
+**This is the #1 rule. You have been told this MANY times. STOP IGNORING IT.**
+❌ **BANNED** (never type these):
+- `ssh azureuser@IP "anything"`
+- `ssh $SSH_OPTS ...`
+- Any command starting with `ssh` to the VM
+✅ **REQUIRED** (always use these instead):
+- `uv run python -m openadapt_ml.benchmarks.cli vm exec --cmd "your command"`
+- `uv run python -m openadapt_ml.benchmarks.cli vm diag`
+- `uv run python -m openadapt_ml.benchmarks.cli vm logs`
+**If a CLI command doesn't exist, ADD IT TO THE CLI FIRST, then use it.**
+**Before running ANY command involving the VM, ask yourself:**
+1. Does this start with `ssh`? → STOP, use CLI instead
+2. Is this a raw shell command to the VM? → STOP, use CLI instead
+3. Can I use `vm exec --cmd`? → YES, use it
+This has been explained to you repeatedly. FOLLOW IT.
+---
+## 🔧 DOCKERFILE/VM CHANGES: TEST INSIDE CONTAINER FIRST
+**Problem**: Each Dockerfile change triggers: rebuild (10 min) → Windows boot (15 min) → test → repeat. Hours wasted on tiny changes.
+**Solution**: Test fixes INSIDE a running container BEFORE rebuilding:
+```bash
+# 1. Start a test container with bash entrypoint (seconds)
+uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \
+  'docker run -d --name test-fix --entrypoint /bin/bash waa-auto:latest -c "sleep 3600"'
+# 2. Apply your fix manually INSIDE the container (seconds)
+uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \
+  "docker exec test-fix sed -i 's/old/new/' /some/file.sh"
+# 3. Verify the fix works (seconds)
+uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \
+  "docker exec test-fix cat /some/file.sh"
+# 4. Test the actual behavior (seconds)
+uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd \
+  "docker exec test-fix /some/script.sh && ls /expected/output"
+# 5. Cleanup
+uv run python -m openadapt_ml.benchmarks.cli vm host-exec --cmd 'docker rm -f test-fix'
+# 6. ONLY AFTER fix is verified: Update Dockerfile and rebuild ONCE
+```
+**Why this matters**:
+- Testing a fix takes SECONDS instead of 30+ minutes
+- Iterate 10x on the fix before committing to a rebuild
+- Don't lose context waiting for long builds
+- Each rebuild should be the LAST rebuild, not a guess
+---
 ## Project Overview
 openadapt-ml is a model-agnostic, domain-agnostic ML engine for GUI automation agents. It provides:
@@ -11,7 +113,18 @@ openadapt-ml is a model-agnostic, domain-agnostic ML engine for GUI automation a
 - Supervised fine-tuning pipeline
 - Runtime policy API
-## Current Focus: Benchmark Integration
+## Current Focus: Demo Retrieval
+**Validated**: Demo-conditioned prompting improves action accuracy (Dec 2024)
+- Zero-shot: 33% correct first actions
+- With demo: 100% correct first actions
+- See `docs/experiments/demo_conditioned_prompting_results.md`
+**Next step**: Build demo retrieval to automatically select relevant demos from a library.
+**Key insight**: OpenAdapt's value is **trajectory-conditioned disambiguation of UI affordances**, not "better reasoning".
+## Benchmark Integration
 **Primary benchmark**: Windows Agent Arena (WAA)
 - 154 tasks across 11 Windows domains
@@ -289,10 +402,13 @@ uv run python -m openadapt_ml.cloud.local serve --open
 - `docs/benchmark_integration_plan.md` - Benchmark integration architecture
 - `docs/azure_waa_setup.md` - Azure WAA setup guide (quota increase, costs, troubleshooting)
 - `docs/design.md` - Overall system design
+- `docs/experiments/demo_conditioned_prompting_results.md` - Demo experiment results (validated Dec 2024)
 - `openadapt_ml/cloud/` - Cloud GPU providers (Lambda Labs, Azure)
 - `openadapt_ml/benchmarks/` - Benchmark integration module (WAA, base classes)
+- `openadapt_ml/experiments/demo_prompt/` - Demo-conditioned prompting experiment
 - `openadapt_ml/grounding/` - Grounding module (GeminiGrounder, etc.)
 - `openadapt_ml/ingest/capture.py` - Converts openadapt-capture recordings to Episodes
+- `scripts/run_demo_experiment.py` - Run demo-conditioned experiment
 - `configs/qwen3vl_synthetic_som.yaml` - SoM training config
 ## Code Patterns
@@ -341,13 +457,94 @@ The training dashboard and capture viewer share UI components for visual consist
 - Single source of truth for styling (no duplicate CSS to maintain)
 - Easier to add new dashboards that match existing style
+## CRITICAL: Always Start Dashboard When Running Azure Resources
+See the ⚠️ MANDATORY section at the TOP of this file. Use:
+```bash
+uv run python -m openadapt_ml.benchmarks.cli vm monitor
+```
+## ⚠️ SAFE PROCESS MANAGEMENT ⚠️
+**NEVER use broad pkill patterns** - they can kill unrelated applications!
+**WRONG (DANGEROUS):**
+```bash
+# These patterns are TOO BROAD and will kill unrelated apps:
+pkill -f "openadapt"      # Kills anything with "openadapt" in path
+pkill -f "python"         # Kills ALL Python processes
+pkill -9 -f "openadapt_ml"  # Killed Claude Code, Windsurf, Signal, Chrome tabs!
+```
+**RIGHT (SAFE):**
+```bash
+# Use specific PID-based killing:
+lsof -i :8765 | grep python | awk '{print $2}' | xargs kill 2>/dev/null
+# Or use specific process names with full path matching:
+pkill -f "python.*-m openadapt_ml.cloud.local serve"
+# Or kill only the specific port listener:
+kill $(lsof -t -i :8765) 2>/dev/null
+# Check what would be killed FIRST:
+pgrep -f "openadapt" -l  # Lists matching processes before killing
+```
+**Before any pkill command:**
+1. Run `pgrep -f "pattern" -l` to see what matches
+2. Verify only intended processes are listed
+3. Use the most specific pattern possible
+4. Prefer port-based or PID-based killing
 ## Don't Do
 - Don't add timelines/estimates to plans
 - Don't mention specific clients by name in public docs
 - Don't over-engineer - keep solutions minimal
 - Don't use `os.environ` directly - use `config.settings` instead
-- Don't use `pip install` - always use `uv pip install` or `uv add` for consistency
+- Don't use `pip install` - always use `uv add` for dependencies or `uv sync` for the project
+- **Don't run Azure/VM operations without starting the dashboard first**
+  - ❌ WRONG: `vm probe` then `vm diag` then telling user to run `vm monitor`
+  - ✅ RIGHT: `vm monitor` FIRST (it does probe, tunnels, everything)
+  - This is the #1 mistake you keep making. STOP IT.
+- **Don't use raw SSH/shell commands** - always use or create CLI commands instead (see below)
+- **Don't tell user to run commands** - YOU run them. The CLI exists so YOU can use it.
+## CLI-First Development (IMPORTANT)
+**ALWAYS** use CLI commands instead of raw SSH/shell commands:
+- ✅ `uv run python -m openadapt_ml.benchmarks.cli vm diag` (not `ssh ... df -h`)
+- ✅ `uv run python -m openadapt_ml.benchmarks.cli vm logs` (not `ssh ... docker logs`)
+- ✅ `uv run python -m openadapt_ml.benchmarks.cli vm probe` (not `ssh ... curl`)
+**Why**: CLI commands are documented, tested, and persist across context compactions. Raw commands are forgotten.
+**When you need a new operation**:
+1. Add a new action to the relevant CLI subcommand (e.g., `vm logs`, `vm exec`)
+2. Document it in CLAUDE.md
+3. Use the CLI command going forward
+**Available VM CLI commands**:
+```bash
+vm monitor       # THE GO-TO COMMAND: Start dashboard, open browser, show probe status
+                 # Options: --auto-shutdown-hours N (deallocate after N hours)
+vm setup-waa     # Full VM setup with Docker and waa-auto image
+vm run-waa       # Run benchmark (requires waa-auto image, --rebuild to force image rebuild)
+vm diag          # Check disk, Docker, containers, WAA probe status
+vm logs          # View container logs (--lines N, --follow)
+vm probe         # Check WAA server status (--wait to poll)
+vm exec          # Run command in container (--cmd 'your command')
+vm fix-oem       # Copy OEM files to Samba share (for manual install.bat)
+vm docker-prune  # Clean Docker images, containers, build cache (free disk space)
+vm docker-move   # Move Docker/containerd to /mnt via symlinks (147GB space)
+vm stop-build    # Stop running Docker build and clean build cache
+vm status        # Azure VM status
+vm ssh           # Interactive SSH
+vm deallocate    # Stop VM billing (preserves disk), use -y to skip confirmation
+vm start         # Start a deallocated VM
+vm delete        # Delete VM (use -y to skip confirmation)
+```
 ## TODO / Known Issues
@@ -406,6 +603,144 @@ az ml workspace sync-keys -n openadapt-ml -g openadapt-agents
 - [Azure ML Managed Identity ACR Authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-identity-based-service-authentication)
 - [ACR Pull Role Assignment](https://learn.microsoft.com/en-us/azure/container-registry/container-registry-authentication-managed-identity)
+### Azure WAA Evaluation - Dedicated VM Setup
+**Status**: WORKING - Custom `waa-auto` Docker image REQUIRED (verified Jan 2026)
+**Problem**: WAA requires running a Windows VM inside Docker (via QEMU). Azure ML managed compute doesn't support nested virtualization.
+**CRITICAL**: The official `windowsarena/winarena:latest` image is **BROKEN**. It uses an outdated `dockurr/windows v0.00` that does NOT auto-download Windows 11. You will get "ISO file not found" errors and the VM will never start.
+**Solution**: The CLI builds a custom `waa-auto` Docker image that:
+1. Uses modern `dockurr/windows:latest` (v5.14+) which auto-downloads Windows 11
+2. Installs Python 3 and all WAA client dependencies
+3. Patches IP addresses for dockurr/windows networking
+**Working Quick Start** (via CLI - fully automated):
+```bash
+# 1. Setup VM with Docker and build waa-auto image (~10 min)
+uv run python -m openadapt_ml.benchmarks.cli vm setup-waa --api-key $OPENAI_API_KEY
+# 2. Run benchmark (Windows downloads on first run, ~15 min, then ~30 min/20 tasks)
+uv run python -m openadapt_ml.benchmarks.cli vm run-waa --num-tasks 20
+# 3. Delete VM when done (IMPORTANT: stops billing!)
+uv run python -m openadapt_ml.benchmarks.cli vm delete
+```
+**Diagnostic commands**:
+```bash
+# Check VM disk, Docker, containers, WAA probe status
+uv run python -m openadapt_ml.benchmarks.cli vm diag
+# Check VM Azure status
+uv run python -m openadapt_ml.benchmarks.cli vm status
+# SSH into VM for debugging
+uv run python -m openadapt_ml.benchmarks.cli vm ssh
+# Check if WAA server is ready
+uv run python -m openadapt_ml.benchmarks.cli vm probe --wait
+# Force rebuild waa-auto if needed
+uv run python -m openadapt_ml.benchmarks.cli vm run-waa --rebuild --num-tasks 5
+```
+**What the CLI does** (via custom `waa-auto` Docker image in `openadapt_ml/benchmarks/waa/Dockerfile`):
+1. Uses modern `dockurr/windows:latest` base (auto-downloads Windows 11)
+2. Copies `/oem` folder from official WAA image (fixes OEM folder issue)
+3. Patches IP addresses (20.20.20.21 → 172.30.0.2)
+4. Adds automation commands to Windows FirstLogonCommands:
+   - Disable firewall, sleep, lock screen
+   - **Auto-runs install.bat** to install Python, Chrome, LibreOffice, VSCode, WAA server
+5. Installs Python dependencies for benchmark client
+**Fully automated** - no manual VNC login or script execution needed!
+**Key requirements**:
+1. **VM Size**: `Standard_D4ds_v5` or larger (nested virtualization required)
+2. **Docker storage**: Scripts use `/mnt/WindowsAgentArena/src/win-arena-container/vm/storage`
+3. **ISO location**: `src/win-arena-container/vm/image/setup.iso`
+4. **API key**: `config.json` in repo root with OPENAI_API_KEY
+5. **Valid model name**: Must use real OpenAI model (e.g., `gpt-4o`, `gpt-4o-mini`). Invalid names cause benchmark to hang on API retries.
+**Architecture**:
+```
+Azure VM (Standard_D4ds_v5, nested virt enabled)
+  └── Docker (data on /mnt)
+       └── winarena:latest (built by run-local.sh)
+            └── QEMU running Windows 11 VM (IP: 20.20.20.21)
+                 └── WAA Flask server on port 5000
+                 └── Navi agent executing tasks
+```
+**Monitor progress**:
+- VNC: `http://localhost:8006` (via SSH tunnel, auto-managed by dashboard)
+- Logs: `tail -f /tmp/waa_benchmark.log` (if running via nohup)
+**Files**:
+- `openadapt_ml/benchmarks/cli.py` - `vm` subcommand with setup-waa, probe
+- `openadapt_ml/cloud/ssh_tunnel.py` - SSH tunnel manager (auto VNC/WAA tunnels)
+- `docs/waa_setup.md` - Detailed setup guide
+### SSH Tunnel Management (VNC/WAA Access)
+**Status**: DONE
+**Problem**: Azure VMs have Network Security Groups (NSGs) that only expose port 22 (SSH) by default. Ports 8006 (VNC) and 5000 (WAA) are not accessible directly.
+**Solution**: Automatic SSH tunnel management via `SSHTunnelManager`:
+```
+Browser → localhost:8006 → SSH Tunnel → Azure VM:8006 → Docker → noVNC
+Browser → localhost:5000 → SSH Tunnel → Azure VM:5000 → WAA Flask
+```
+**Architecture**:
+1. When VM's WAA probe becomes "ready", tunnels auto-start
+2. When VM goes offline, tunnels auto-stop
+3. Dashboard shows tunnel status next to VNC button
+4. VNC button links to localhost:port (tunnel endpoint)
+**Files**:
+- `openadapt_ml/cloud/ssh_tunnel.py` - SSHTunnelManager class
+- `openadapt_ml/cloud/local.py` - Integration with dashboard server
+- `openadapt_ml/training/benchmark_viewer.py` - UI showing tunnel status
+**API Endpoints**:
+- `GET /api/tunnels` - Returns tunnel status for VNC and WAA
+- `GET /api/vms` - Includes `tunnels` field with per-tunnel status
+**Key features**:
+- Auto-start on VM online (idempotent - safe to call repeatedly)
+- Auto-stop on VM offline
+- Port conflict detection
+- Graceful shutdown on process exit
+- No manual SSH commands needed
+**Manual usage** (if needed):
+```python
+from openadapt_ml.cloud.ssh_tunnel import get_tunnel_manager
+manager = get_tunnel_manager()
+manager.start_tunnels_for_vm("172.171.112.41", "azureuser")
+status = manager.get_tunnel_status()
+manager.stop_all_tunnels()
+```
+**Why not open NSG ports?**
+1. VNC has no authentication by default - anyone can connect
+2. SSH tunnel encrypts all traffic
+3. Requires SSH key auth - no password guessing
+4. No Azure NSG changes needed
+**Alternative: Mock evaluation** for testing without Windows:
+```bash
+uv run python -m openadapt_ml.benchmarks.cli test-mock --tasks 20
+```
+**References**:
+- [Windows Agent Arena GitHub](https://github.com/microsoft/WindowsAgentArena)
+- [Azure nested virtualization](https://learn.microsoft.com/en-us/azure/virtual-machines/acu)
 ### Training Dashboard - Terminal Output Streaming
 **Status**: DONE
@@ -522,7 +857,7 @@ Verified:
 - Backend flag options: `claude`, `openai` in CLI ✓
 ### Benchmark Viewer Integration
-**Status**: Phase 1 DONE, Phases 2-4 TODO
+**Status**: Phases 1-3 DONE, Phase 4 TODO
 **Goal**: Integrate benchmark evaluation results (WAA, WebArena, OSWorld) into the unified viewer.
@@ -532,7 +867,7 @@ Verified:
 1. **Benchmarks tab**: Third tab alongside Training and Viewer
 2. **Task-level view**: List of benchmark tasks with pass/fail status
 3. **Step-by-step replay**: Same UI as Viewer tab for benchmark executions
-4. **Model comparison**: Side-by-side comparison of different models on same task
+4. **Model comparison**: Side-by-side comparison of different models on same task (TODO)
 5. **Aggregate metrics**: Success rate by domain, difficulty rankings
 **Implementation phases**:
@@ -543,22 +878,30 @@ Verified:
    - Directory structure: `benchmark_results/{run_name}/tasks/{task_id}/`
    - Each task has: `task.json`, `execution.json`, `screenshots/`
    - Test script: `test_data_collection.py` validates all files are created
-2. **Viewer backend** (TODO): `generate_benchmark_viewer()` function
-3. **UI components** (TODO): Summary dashboard, task list, replay
+2. ✅ **Viewer backend** (DONE): `generate_benchmark_viewer()` function
+   - Created `openadapt_ml/benchmarks/viewer.py` with viewer generation
+   - Added CLI command: `uv run python -m openadapt_ml.benchmarks.cli view --run-name {name}`
+   - Generates standalone HTML with same styling as training viewer
+   - Uses shared header components via `shared_ui.py`
+3. ✅ **UI components** (DONE - Basic): Summary dashboard, task list, replay
+   - Summary panel with total tasks, passed/failed, success rate
+   - Domain breakdown with per-domain statistics
+   - Filter controls (domain, status)
+   - Task list with status badges
+   - Step-by-step viewer with screenshots, actions, reasoning
+   - Playback controls (prev/next, play/pause, speed)
+   - Keyboard shortcuts (Space, arrows, Home/End)
 4. **Analysis** (TODO): Failure clustering, regression detection
-**Phase 1 verification:**
+**View benchmark results:**
 ```bash
-# Test data collection
-uv run python -m openadapt_ml.benchmarks.cli test-collection --tasks 5
+# Generate HTML viewer and serve it
+uv run python -m openadapt_ml.benchmarks.cli view --run-name {name}
-# Verify output
-ls -la benchmark_results/{run_name}/tasks/task_001/
-# Should contain: task.json, execution.json, screenshots/
-# Check JSON structure
-cat benchmark_results/{run_name}/summary.json
-cat benchmark_results/{run_name}/tasks/task_001/execution.json
+# Options:
+# --embed-screenshots  Embed screenshots as base64 (standalone HTML)
+# --no-open            Don't auto-open browser
+# --port 9000          Use custom port
 ```
 ## Preventing Stale Data Issues
@@ -618,3 +961,23 @@ The viewer should automatically load:
 | Predictions not extracted | HTML uses `window.comparisonData` but regex expects `const` | Use regex `(?:const\s+\|window\.)comparisonData` pattern |
 | Stale data after code change | Browser caching HTML | Hard refresh (Cmd+Shift+R) or disable cache |
 | Screenshots 404 | Screenshot symlink broken | Recreate: `ln -sf /path/to/capture/screenshots training_output/current/screenshots` |
+### UI/Display Guidelines
+**Placeholder data must be clearly marked** when displaying values that may not reflect actual data:
+- If task counts, worker counts, etc. come from local tracking (not synced with Azure), mark them with an asterisk: "3* tasks • 1* worker(s)"
+- Add a footnote: "[*: placeholder, actual values may differ]"
+- This applies to any data that is locally cached but not confirmed from the authoritative source
+### Azure ML Integration Notes
+**Experiment ID**: The Azure ML experiments page URL requires an experiment ID which is workspace-specific:
+- Current hardcoded ID: `ad29082c-0607-4fda-8cc7-38944eb5a518`
+- **TODO**: Retrieve experiment_id dynamically from Azure using `az ml experiment list`
+- The experiment name is `openadapt-ml` but the URL requires the UUID format
+**Azure ML URL format**:
+- Jobs list: `https://ml.azure.com/experiments/id/{experiment_id}?wsid={workspace_id}`
+- Specific job: `https://ml.azure.com/experiments/id/{experiment_id}/runs/{run_id}?wsid={workspace_id}`
+**WAA Docker command**: Use `python run.py` not `python -m client.run` (the client directory is not a Python package)

openadapt-ml 0.1.0__tar.gz → 0.2.0__tar.gz

openadapt-ml 0.1.0tar.gz → 0.2.0tar.gz