PyPI - polyharness - Versions diffs - 0.2.0__tar.gz → 0.2.2__tar.gz - Mend

polyharness 0.2.0tar.gz → 0.2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (70) hide show

{polyharness-0.2.0/src/polyharness.egg-info → polyharness-0.2.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: polyharness
-Version: 0.2.0
+Version: 0.2.2
 Summary: Automated harness optimization for AI agents — make your agent evolve.
 Author: weijt606
 License-Expression: MIT
@@ -48,7 +48,7 @@ Dynamic: license-file
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
-[![Tests](https://img.shields.io/badge/tests-165%20passing-brightgreen.svg)]()
+[![Tests](https://img.shields.io/badge/tests-206%20passing-brightgreen.svg)]()
 [![中文文档](https://img.shields.io/badge/文档-中文版-red.svg)](README_CN.md)
 ---
@@ -63,7 +63,7 @@ Your AI agent runs the same harness every time. Same prompts, same tool config,
 | | |
 |---|---|
 | **Self-Evolution** | Iteratively searches over harness changes and keeps the full evaluation history in one workspace. |
-| **7 Agent Backends** | Claude Code · Claw Code · Codex · OpenCode · API direct · OpenAI-compatible · Local — plug in any CLI agent. |
+| **8 Agent Backends** | Claude Code · Claw Code · Codex · Hermes · OpenCode · API direct · OpenAI-compatible · Local — plug in any CLI agent. |
 | **Full History** | Every iteration's code, scores, and traces preserved. The Meta-Harness paper reports that non-Markovian search outperforms blind retries. |
 | **Search Tree** | Visualize the optimization path. Compare any two candidates with per-task diffs. |
 | **One-Command Setup** | `ph init --base-harness ... --task-dir ...` — copies files, configures workspace, done. |
@@ -86,13 +86,19 @@ PolyHarness fills that gap. It's the open-source engine that makes Meta-Harness
 > - Memory tools (like Supermemory) give agents persistent **memory** across conversations.
 > - **PolyHarness gives agents persistent self-evolution** — you get a repeatable way to refine how they work over time.
+### Part of a wave — specialized for harnesses
+PolyHarness doesn't stand alone. A wave of open-source projects has shown that pairing LLMs with evolutionary search systematically improves code and prompts: [GEPA](https://github.com/gepa-ai/gepa) (reflective prompt evolution over a Pareto frontier), [ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) (sample-efficient program evolution), [OpenEvolve](https://github.com/algorithmicsuperintelligence/openevolve) (an open AlphaEvolve), and the [Darwin Gödel Machine](https://sakana.ai/dgm/) (open-ended self-improving agents).
+Most of these evolve *general* programs or algorithms. PolyHarness is the member of this wave **specialized for agent harnesses** — the prompts, tool config, and orchestration *around* an existing agent — with a focus on **online evolution from real usage** (`ph wrap` → `ph evolve`). It borrows the strongest ideas from these projects and applies them to any CLI agent on your own tasks: Pareto-frontier parent selection (GEPA), code-novelty rejection and an adaptive backend ensemble (ShinkaEvolve), and cascade evaluation (AlphaEvolve/OpenEvolve).
 ## What PolyHarness Is
 PolyHarness is the open-source engine for iteratively searching over an agent's harness.
 It builds on ideas from the Meta-Harness paper and the TBench2 results reported there, while focusing this repository on the optimization workflow itself — how harness variants are proposed, evaluated, and revised over repeated runs.
-If tools like ForgeCode help you code, PolyHarness helps you search for task-specific harness improvements by iterating on prompts, tool use, and harness logic.
+If tools like [ForgeCode](https://github.com/antinomyhq/forgecode) help you code, PolyHarness helps you search for task-specific harness improvements by iterating on prompts, tool use, and harness logic.
 ---
@@ -262,7 +268,7 @@ PolyHarness automatically sandboxes your agent inside this workspace, ensuring i
 | Scenario | How to configure |
 |----------|------------------|
-| **Supported CLI Tools** | Run `ph init --agent <name>`. PolyHarness auto-injects required instructions (e.g., `CLAUDE.md`).<br>*(Supported: claude-code, claw-code, codex, opencode)* |
+| **Supported CLI Tools** | Run `ph init --agent <name>`. PolyHarness auto-injects required instructions (e.g., `CLAUDE.md`).<br>*(Supported: claude-code, claw-code, codex, hermes, opencode)* |
 | **Anthropic API** | Run `ph init --agent api`. Set `export ANTHROPIC_API_KEY="sk-ant-..."` before `ph run`. |
 | **OpenAI / Local Models** | Run `ph init --agent openai`. Then configure the endpoint — see [Local Model Setup](#local-model-setup) below. |
 | **Custom CLI path** | If your CLI agent uses a non-standard command, edit `config.yaml` in the workspace before running:<br>`proposer: { cli_path: "npx @anthropic-ai/claude-code" }`|
@@ -275,6 +281,34 @@ ph run
 The orchestrator: copies your harness → asks the Proposer agent for a candidate change → evaluates the result → stores everything → repeats.
+```
+┌──────────────────────────────────────────────────────────────┐
+│                                                              │
+│   You                          PolyHarness                   │
+│    │                              │                          │
+│    ├── ph init ──────────────────→│ Creates workspace        │
+│    │   (harness + tasks + eval)   │ Copies files             │
+│    │                              │ Injects CLAUDE.md        │
+│    │                              │                          │
+│    ├── ph run ───────────────────→│ Starts search loop:      │
+│    │                              │                          │
+│    │   ┌──────────────────────────┤                          │
+│    │   │  Step 1: SELECT parent   │ Best or Tournament       │
+│    │   │  Step 2: COPY harness    │ From parent → candidate  │
+│    │   │  Step 3: PROPOSE changes │ Agent reads all history  │
+│    │   │  Step 4: EVALUATE        │ Run tasks, get scores    │
+│    │   │  Step 5: STORE results   │ Code + scores + traces   │
+│    │   │  Step 6: CHECK stopping  │ Improved? Patience left? │
+│    │   └──────────┬───────────────┤                          │
+│    │              └── loop ───────┘                          │
+│    │                              │                          │
+│    ├── ph log ───────────────────→│ Shows search tree        │
+│    ├── ph compare 0 5  ──────────→│ Score deltas + code diff │
+│    └── ph apply ─────────────────→│ Writes best back         │
+│                                                              │
+└──────────────────────────────────────────────────────────────┘
+```
 ### 5. Inspect and apply
 ```bash
@@ -303,6 +337,7 @@ Just add `ph wrap --auto-evolve` in front of your agent command (pick the one ma
 ph wrap --auto-evolve claude -p "Refactor the auth module to use JWT"   # Claude Code
 ph wrap --auto-evolve claw -p "Write integration tests for payments"     # Claw Code
 ph wrap --auto-evolve codex "Add retry logic to the API client"          # Codex
+ph wrap --auto-evolve hermes chat -q "Refactor the DB connection pool"   # Hermes Agent
 ph wrap --auto-evolve opencode -p "Fix the flaky parser test"            # OpenCode
 # Local models — wrap the CLI command directly
@@ -358,7 +393,67 @@ ph evolve                      # trigger evolution manually
 > **Tip:** Use `--no-record-output` if you don't want stdout/stderr saved (e.g., for sensitive output). Metadata is always recorded.
-> **Tip:** Create a shell alias for even less typing: `alias cc="ph wrap --auto-evolve claude"`
+#### Zero-config auto-wrap: `ph shell-hook`
+Don't want to type `ph wrap --auto-evolve` every time? Install a shell hook — it auto-intercepts agent commands:
+```bash
+ph shell-hook install          # one-time setup, writes to ~/.zshrc
+```
+After that, just use your agent as usual:
+```bash
+claude -p "Refactor auth to JWT"        # automatically becomes: ph wrap --auto-evolve claude -p ...
+claw -p "Write payment tests"            # same — auto-wrapped
+codex "Add retry logic"                  # same
+hermes chat -q "Refactor pool"           # same
+opencode -p "Fix flaky test"             # same
+```
+How it works: a `preexec` hook in your shell detects `claude`/`claw`/`codex`/`hermes`/`opencode` commands and transparently redirects them through `ph wrap --auto-evolve`. Your output is unchanged.
+```bash
+ph shell-hook status           # check if installed
+ph shell-hook uninstall        # remove cleanly (restores original rc file)
+```
+#### Auto-Evolution flow
+```
+┌──────────────────────────────────────────────────────────────┐
+│                                                              │
+│  You                            PolyHarness                  │
+│   │                               │                          │
+│   ├── ph shell-hook install ────→ │ Injects preexec hook     │
+│   │   (one-time setup)            │ into ~/.zshrc            │
+│   │                               │                          │
+│   ├── claude -p "Fix bug" ──────→ │ Shell hook intercepts    │
+│   │   (normal usage)              │                          │
+│   │                               ├── Run agent              │
+│   │   ┌─ output passes through  ──┤                          │
+│   │   │                           ├── Record trace           │
+│   │   │                           │   (~/.polyharness/       │
+│   │   │                           │    traces/)              │
+│   │   │                           │                          │
+│   │   │                           ├── Check threshold        │
+│   │   │                           │   traces < 50?           │
+│   │   │                           │   ├─ Yes: "7/50 traces"  │
+│   │   │                           │   └─ No: trigger ───┐    │
+│   │   │                           │                     │    │
+│   │   │                           │   ┌─────────────────┘    │
+│   │   │                           │   │ Evolution cycle      │
+│   │   │                           │   │ (same as ph run)     │
+│   │   │                           │   │ Propose → Evaluate   │
+│   │   │                           │   │ → Store → Repeat     │
+│   │   │                           │   └──────────────────    │
+│   │   │                           │                          │
+│   └───┘                           │                          │
+│                                                              │
+└──────────────────────────────────────────────────────────────┘
+```
+The key difference: **you never run `ph run` manually.** You use your agent as always; PolyHarness silently collects data and triggers evolution when it has enough signal.
 ### Try it now (no API key needed)
@@ -380,35 +475,7 @@ The score path above is the current measured result of the bundled `math-word-pr
 ## How It Works
-PolyHarness runs a **Meta-Harness-style search loop** — an iterative process where an AI agent proposes, evaluates, and stores harness changes:
-```
-┌──────────────────────────────────────────────────────────────┐
-│                                                              │
-│   You                          PolyHarness                   │
-│    │                              │                          │
-│    ├── ph init ──────────────────→│ Creates workspace        │
-│    │   (harness + tasks + eval)   │ Copies files             │
-│    │                              │ Injects CLAUDE.md        │
-│    │                              │                          │
-│    ├── ph run ───────────────────→│ Starts search loop:      │
-│    │                              │                          │
-│    │   ┌──────────────────────────┤                          │
-│    │   │  Step 1: SELECT parent   │ Best or Tournament       │
-│    │   │  Step 2: COPY harness    │ From parent → candidate  │
-│    │   │  Step 3: PROPOSE changes │ Agent reads all history  │
-│    │   │  Step 4: EVALUATE        │ Run tasks, get scores    │
-│    │   │  Step 5: STORE results   │ Code + scores + traces   │
-│    │   │  Step 6: CHECK stopping  │ Improved? Patience left? │
-│    │   └──────────┬───────────────┤                          │
-│    │              └── loop ───────┘                          │
-│    │                              │                          │
-│    ├── ph log ───────────────────→│ Shows search tree        │
-│    ├── ph compare 0 5  ──────────→│ Score deltas + code diff │
-│    └── ph apply ─────────────────→│ Writes best back         │
-│                                                              │
-└──────────────────────────────────────────────────────────────┘
-```
+PolyHarness runs a **Meta-Harness-style search loop** — an iterative process where an AI agent proposes, evaluates, and stores harness changes. See the detailed flow diagrams above in [Step 4](#4-run-the-optimization-loop) and [Step 6](#6-auto-evolution).
 ### Why it works: non-Markovian search
@@ -433,12 +500,23 @@ The Proposer reads **all of this** before generating the next candidate. It can
 | `claude-code` | `claude -p` | Official Claude Code CLI (Pro/Teams subscription) |
 | `claw-code` | `claw -p` | Open-source Claw Code CLI |
 | `codex` | `codex --quiet` | OpenAI Codex CLI |
+| `hermes` | `hermes chat -q` | Nous Research [Hermes Agent](https://github.com/NousResearch/hermes-agent) CLI |
 | `opencode` | `opencode -p` | OpenCode CLI |
 | `local` | — | Offline rule-based engine for development & testing |
 `ph doctor` auto-detects all available backends and shows their status.
-When you run `ph init --agent claude-code`, PolyHarness automatically generates a `CLAUDE.md` instruction file in the workspace, telling the agent how to behave as an optimization Proposer. Same for `CLAW.md`, `CODEX.md`, `OPENCODE.md` — each agent's native instruction format.
+When you run `ph init --agent claude-code`, PolyHarness automatically generates a `CLAUDE.md` instruction file in the workspace, telling the agent how to behave as an optimization Proposer. Same for `CLAW.md`, `CODEX.md`, `AGENTS.md` (Hermes), `OPENCODE.md` — each agent's native instruction format.
+#### Backend ensemble (adaptive selection)
+Don't know which backend writes the best harness changes for your task? Let PolyHarness find out. Pass several and it picks one per iteration with a **UCB bandit**, shifting picks toward whichever backend actually produces *improving* candidates:
+```bash
+ph run --ensemble "claude-code,codex,local"
+```
+At the end of the run you get a per-backend breakdown (picks + improve-rate). Selection is deterministic given the reward sequence, so runs stay reproducible. Inspired by ShinkaEvolve's adaptive LLM-ensemble selection.
 ### Local Model Setup
@@ -488,10 +566,16 @@ After `ph init`, the workspace has a `config.yaml` with these sections:
 search:
   max_iterations: 20          # Maximum search iterations
   early_stop_patience: 5      # Stop after N iterations with no improvement
-  parent_selection: best       # Strategy: best | tournament | all
+  parent_selection: best       # Strategy: best | tournament | all | pareto
+  novelty_filter: false        # Reject near-duplicate candidates before eval (saves budget)
+  novelty_threshold: 0.97      # Similarity ratio above which a candidate is a near-duplicate
+  novelty_max_retries: 1       # Regenerate a near-duplicate this many times before skipping
+  seed: null                   # RNG seed — set an int to make randomized runs reproducible
 proposer:
-  backend: api                 # api | openai | claude-code | claw-code | codex | opencode | local
+  backend: api                 # api | openai | claude-code | claw-code | codex | hermes | opencode | local
+  ensemble: []                 # If non-empty, pick among these backends per iteration via a UCB bandit
+  bandit_c: 1.41421356         # UCB exploration constant (higher = more exploration)
   model: claude-sonnet-4-20250514  # Model name (for api/openai backends)
   base_url: null               # Custom API endpoint (for openai backend)
   api_key: null                # API key override (null = use env var)
@@ -503,6 +587,9 @@ evaluator:
   type: python                 # python | docker | custom
   entry: evaluate.py           # Evaluator script entrypoint
   timeout: 300                 # Per-task timeout in seconds
+  cascade: false               # Stage cheap subset first; skip rest if it fails the gate (per-task mode)
+  cascade_threshold: 0.4       # Min stage-1 mean score required to run the full task set
+  cascade_stage1: 0            # Tasks in stage 1 (0 = auto, ~1/3 of the list)
 harness:
   language: python             # Harness code language
@@ -570,11 +657,11 @@ python -m polyharness --version
 | `ph init` | Initialize workspace with auto-copy of harness, tasks, eval script |
 | `ph run` | Start the optimization search loop |
 | `ph status` | Progress table with elapsed time, improvement rate, and delta |
-| `ph log` | Search tree with delta (Δ) column (or `--flat` for table) |
+| `ph log` | Search tree with delta (Δ) column and Pareto-frontier (◆) markers (or `--flat` for table) |
 | `ph best` | Show best candidate: score, per-task breakdown, changes summary |
 | `ph compare A B` | Compare two iterations: score deltas + unified code diff |
 | `ph diff <N>` | Shorthand for `compare 0 <N>` |
-| `ph leaderboard` | Ranked table of all candidates (`--top N`, `--tasks` drilldown) |
+| `ph leaderboard` | Ranked table of all candidates with Pareto (◆) and backend columns (`--top N`, `--tasks` drilldown) |
 | `ph trace <N>` | View stdout, stderr, metrics, exit code for an iteration |
 | `ph report` | Generate a full markdown report with score trends and per-task table |
 | `ph apply` | Copy best harness back to `base_harness/` (or `--target` dir) |
@@ -588,6 +675,9 @@ python -m polyharness --version
 | `ph traces stats` | Summary statistics: total traces, scored count, agent distribution |
 | `ph traces clear` | Remove collected traces (`--keep N` to retain newest, `-y` to skip confirm) |
 | `ph evolve` | Trigger an online evolution cycle using collected traces as context |
+| `ph shell-hook install` | Install shell hook to auto-wrap agent commands (claude, claw, codex, opencode) |
+| `ph shell-hook uninstall` | Remove the shell hook from your rc file |
+| `ph shell-hook status` | Check if the shell hook is installed |
 | `ph upgrade` | Upgrade PolyHarness to the latest version |
 | `ph uninstall` | Uninstall PolyHarness from the current environment (`-y` to skip confirm) |
@@ -615,7 +705,8 @@ python -m polyharness --version
 --dry-run            Only evaluate the base harness, skip search
 --resume             Continue an interrupted search from where it left off
 --backend <name>     Override proposer backend without editing config
---strategy <name>    Override parent selection: best | tournament | all
+--strategy <name>    Override parent selection: best | tournament | all | pareto
+--ensemble b1,b2,... Pick among multiple backends per iteration via a UCB bandit
 ```
 ### `ph wrap` options
@@ -697,7 +788,7 @@ ph run --max-iterations 5
 ```
 polyharness/
 ├── src/polyharness/
-│   ├── cli.py                   # Click CLI — 22 commands/subcommands
+│   ├── cli.py                   # Click CLI — 25 commands/subcommands
 │   ├── config.py                # Pydantic config models (+ EvolutionConfig)
 │   ├── collector.py             # Trace collector for online evolution
 │   ├── orchestrator.py          # Meta-Harness search loop + progress bar + error recovery
@@ -715,6 +806,7 @@ polyharness/
 │   │       ├── claude_code.py   # claude -p
 │   │       ├── claw_code.py     # claw -p
 │   │       ├── codex.py         # codex --quiet --auto-edit
+│   │       ├── hermes.py        # hermes chat -q
 │   │       └── opencode.py      # opencode -p
 │   └── templates/               # 5 built-in task templates
 │       ├── text-classification/
@@ -722,7 +814,7 @@ polyharness/
 │       ├── code-generation/
 │       ├── rag-qa/
 │       └── api-calling/
-├── tests/                       # 165 tests (pytest)
+├── tests/                       # 173 tests (pytest)
 ├── bin/                         # npm wrapper (ph.mjs, postinstall.mjs)
 ├── docs/
 │   ├── development/             # Product roadmap & technical architecture

{polyharness-0.2.0 → polyharness-0.2.2}/README.md RENAMED Viewed

@@ -15,7 +15,7 @@
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
-[![Tests](https://img.shields.io/badge/tests-165%20passing-brightgreen.svg)]()
+[![Tests](https://img.shields.io/badge/tests-206%20passing-brightgreen.svg)]()
 [![中文文档](https://img.shields.io/badge/文档-中文版-red.svg)](README_CN.md)
 ---
@@ -30,7 +30,7 @@ Your AI agent runs the same harness every time. Same prompts, same tool config,
 | | |
 |---|---|
 | **Self-Evolution** | Iteratively searches over harness changes and keeps the full evaluation history in one workspace. |
-| **7 Agent Backends** | Claude Code · Claw Code · Codex · OpenCode · API direct · OpenAI-compatible · Local — plug in any CLI agent. |
+| **8 Agent Backends** | Claude Code · Claw Code · Codex · Hermes · OpenCode · API direct · OpenAI-compatible · Local — plug in any CLI agent. |
 | **Full History** | Every iteration's code, scores, and traces preserved. The Meta-Harness paper reports that non-Markovian search outperforms blind retries. |
 | **Search Tree** | Visualize the optimization path. Compare any two candidates with per-task diffs. |
 | **One-Command Setup** | `ph init --base-harness ... --task-dir ...` — copies files, configures workspace, done. |
@@ -53,13 +53,19 @@ PolyHarness fills that gap. It's the open-source engine that makes Meta-Harness
 > - Memory tools (like Supermemory) give agents persistent **memory** across conversations.
 > - **PolyHarness gives agents persistent self-evolution** — you get a repeatable way to refine how they work over time.
+### Part of a wave — specialized for harnesses
+PolyHarness doesn't stand alone. A wave of open-source projects has shown that pairing LLMs with evolutionary search systematically improves code and prompts: [GEPA](https://github.com/gepa-ai/gepa) (reflective prompt evolution over a Pareto frontier), [ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) (sample-efficient program evolution), [OpenEvolve](https://github.com/algorithmicsuperintelligence/openevolve) (an open AlphaEvolve), and the [Darwin Gödel Machine](https://sakana.ai/dgm/) (open-ended self-improving agents).
+Most of these evolve *general* programs or algorithms. PolyHarness is the member of this wave **specialized for agent harnesses** — the prompts, tool config, and orchestration *around* an existing agent — with a focus on **online evolution from real usage** (`ph wrap` → `ph evolve`). It borrows the strongest ideas from these projects and applies them to any CLI agent on your own tasks: Pareto-frontier parent selection (GEPA), code-novelty rejection and an adaptive backend ensemble (ShinkaEvolve), and cascade evaluation (AlphaEvolve/OpenEvolve).
 ## What PolyHarness Is
 PolyHarness is the open-source engine for iteratively searching over an agent's harness.
 It builds on ideas from the Meta-Harness paper and the TBench2 results reported there, while focusing this repository on the optimization workflow itself — how harness variants are proposed, evaluated, and revised over repeated runs.
-If tools like ForgeCode help you code, PolyHarness helps you search for task-specific harness improvements by iterating on prompts, tool use, and harness logic.
+If tools like [ForgeCode](https://github.com/antinomyhq/forgecode) help you code, PolyHarness helps you search for task-specific harness improvements by iterating on prompts, tool use, and harness logic.
 ---
@@ -229,7 +235,7 @@ PolyHarness automatically sandboxes your agent inside this workspace, ensuring i
 | Scenario | How to configure |
 |----------|------------------|
-| **Supported CLI Tools** | Run `ph init --agent <name>`. PolyHarness auto-injects required instructions (e.g., `CLAUDE.md`).<br>*(Supported: claude-code, claw-code, codex, opencode)* |
+| **Supported CLI Tools** | Run `ph init --agent <name>`. PolyHarness auto-injects required instructions (e.g., `CLAUDE.md`).<br>*(Supported: claude-code, claw-code, codex, hermes, opencode)* |
 | **Anthropic API** | Run `ph init --agent api`. Set `export ANTHROPIC_API_KEY="sk-ant-..."` before `ph run`. |
 | **OpenAI / Local Models** | Run `ph init --agent openai`. Then configure the endpoint — see [Local Model Setup](#local-model-setup) below. |
 | **Custom CLI path** | If your CLI agent uses a non-standard command, edit `config.yaml` in the workspace before running:<br>`proposer: { cli_path: "npx @anthropic-ai/claude-code" }`|
@@ -242,6 +248,34 @@ ph run
 The orchestrator: copies your harness → asks the Proposer agent for a candidate change → evaluates the result → stores everything → repeats.
+```
+┌──────────────────────────────────────────────────────────────┐
+│                                                              │
+│   You                          PolyHarness                   │
+│    │                              │                          │
+│    ├── ph init ──────────────────→│ Creates workspace        │
+│    │   (harness + tasks + eval)   │ Copies files             │
+│    │                              │ Injects CLAUDE.md        │
+│    │                              │                          │
+│    ├── ph run ───────────────────→│ Starts search loop:      │
+│    │                              │                          │
+│    │   ┌──────────────────────────┤                          │
+│    │   │  Step 1: SELECT parent   │ Best or Tournament       │
+│    │   │  Step 2: COPY harness    │ From parent → candidate  │
+│    │   │  Step 3: PROPOSE changes │ Agent reads all history  │
+│    │   │  Step 4: EVALUATE        │ Run tasks, get scores    │
+│    │   │  Step 5: STORE results   │ Code + scores + traces   │
+│    │   │  Step 6: CHECK stopping  │ Improved? Patience left? │
+│    │   └──────────┬───────────────┤                          │
+│    │              └── loop ───────┘                          │
+│    │                              │                          │
+│    ├── ph log ───────────────────→│ Shows search tree        │
+│    ├── ph compare 0 5  ──────────→│ Score deltas + code diff │
+│    └── ph apply ─────────────────→│ Writes best back         │
+│                                                              │
+└──────────────────────────────────────────────────────────────┘
+```
 ### 5. Inspect and apply
 ```bash
@@ -270,6 +304,7 @@ Just add `ph wrap --auto-evolve` in front of your agent command (pick the one ma
 ph wrap --auto-evolve claude -p "Refactor the auth module to use JWT"   # Claude Code
 ph wrap --auto-evolve claw -p "Write integration tests for payments"     # Claw Code
 ph wrap --auto-evolve codex "Add retry logic to the API client"          # Codex
+ph wrap --auto-evolve hermes chat -q "Refactor the DB connection pool"   # Hermes Agent
 ph wrap --auto-evolve opencode -p "Fix the flaky parser test"            # OpenCode
 # Local models — wrap the CLI command directly
@@ -325,7 +360,67 @@ ph evolve                      # trigger evolution manually
 > **Tip:** Use `--no-record-output` if you don't want stdout/stderr saved (e.g., for sensitive output). Metadata is always recorded.
-> **Tip:** Create a shell alias for even less typing: `alias cc="ph wrap --auto-evolve claude"`
+#### Zero-config auto-wrap: `ph shell-hook`
+Don't want to type `ph wrap --auto-evolve` every time? Install a shell hook — it auto-intercepts agent commands:
+```bash
+ph shell-hook install          # one-time setup, writes to ~/.zshrc
+```
+After that, just use your agent as usual:
+```bash
+claude -p "Refactor auth to JWT"        # automatically becomes: ph wrap --auto-evolve claude -p ...
+claw -p "Write payment tests"            # same — auto-wrapped
+codex "Add retry logic"                  # same
+hermes chat -q "Refactor pool"           # same
+opencode -p "Fix flaky test"             # same
+```
+How it works: a `preexec` hook in your shell detects `claude`/`claw`/`codex`/`hermes`/`opencode` commands and transparently redirects them through `ph wrap --auto-evolve`. Your output is unchanged.
+```bash
+ph shell-hook status           # check if installed
+ph shell-hook uninstall        # remove cleanly (restores original rc file)
+```
+#### Auto-Evolution flow
+```
+┌──────────────────────────────────────────────────────────────┐
+│                                                              │
+│  You                            PolyHarness                  │
+│   │                               │                          │
+│   ├── ph shell-hook install ────→ │ Injects preexec hook     │
+│   │   (one-time setup)            │ into ~/.zshrc            │
+│   │                               │                          │
+│   ├── claude -p "Fix bug" ──────→ │ Shell hook intercepts    │
+│   │   (normal usage)              │                          │
+│   │                               ├── Run agent              │
+│   │   ┌─ output passes through  ──┤                          │
+│   │   │                           ├── Record trace           │
+│   │   │                           │   (~/.polyharness/       │
+│   │   │                           │    traces/)              │
+│   │   │                           │                          │
+│   │   │                           ├── Check threshold        │
+│   │   │                           │   traces < 50?           │
+│   │   │                           │   ├─ Yes: "7/50 traces"  │
+│   │   │                           │   └─ No: trigger ───┐    │
+│   │   │                           │                     │    │
+│   │   │                           │   ┌─────────────────┘    │
+│   │   │                           │   │ Evolution cycle      │
+│   │   │                           │   │ (same as ph run)     │
+│   │   │                           │   │ Propose → Evaluate   │
+│   │   │                           │   │ → Store → Repeat     │
+│   │   │                           │   └──────────────────    │
+│   │   │                           │                          │
+│   └───┘                           │                          │
+│                                                              │
+└──────────────────────────────────────────────────────────────┘
+```
+The key difference: **you never run `ph run` manually.** You use your agent as always; PolyHarness silently collects data and triggers evolution when it has enough signal.
 ### Try it now (no API key needed)
@@ -347,35 +442,7 @@ The score path above is the current measured result of the bundled `math-word-pr
 ## How It Works
-PolyHarness runs a **Meta-Harness-style search loop** — an iterative process where an AI agent proposes, evaluates, and stores harness changes:
-```
-┌──────────────────────────────────────────────────────────────┐
-│                                                              │
-│   You                          PolyHarness                   │
-│    │                              │                          │
-│    ├── ph init ──────────────────→│ Creates workspace        │
-│    │   (harness + tasks + eval)   │ Copies files             │
-│    │                              │ Injects CLAUDE.md        │
-│    │                              │                          │
-│    ├── ph run ───────────────────→│ Starts search loop:      │
-│    │                              │                          │
-│    │   ┌──────────────────────────┤                          │
-│    │   │  Step 1: SELECT parent   │ Best or Tournament       │
-│    │   │  Step 2: COPY harness    │ From parent → candidate  │
-│    │   │  Step 3: PROPOSE changes │ Agent reads all history  │
-│    │   │  Step 4: EVALUATE        │ Run tasks, get scores    │
-│    │   │  Step 5: STORE results   │ Code + scores + traces   │
-│    │   │  Step 6: CHECK stopping  │ Improved? Patience left? │
-│    │   └──────────┬───────────────┤                          │
-│    │              └── loop ───────┘                          │
-│    │                              │                          │
-│    ├── ph log ───────────────────→│ Shows search tree        │
-│    ├── ph compare 0 5  ──────────→│ Score deltas + code diff │
-│    └── ph apply ─────────────────→│ Writes best back         │
-│                                                              │
-└──────────────────────────────────────────────────────────────┘
-```
+PolyHarness runs a **Meta-Harness-style search loop** — an iterative process where an AI agent proposes, evaluates, and stores harness changes. See the detailed flow diagrams above in [Step 4](#4-run-the-optimization-loop) and [Step 6](#6-auto-evolution).
 ### Why it works: non-Markovian search
@@ -400,12 +467,23 @@ The Proposer reads **all of this** before generating the next candidate. It can
 | `claude-code` | `claude -p` | Official Claude Code CLI (Pro/Teams subscription) |
 | `claw-code` | `claw -p` | Open-source Claw Code CLI |
 | `codex` | `codex --quiet` | OpenAI Codex CLI |
+| `hermes` | `hermes chat -q` | Nous Research [Hermes Agent](https://github.com/NousResearch/hermes-agent) CLI |
 | `opencode` | `opencode -p` | OpenCode CLI |
 | `local` | — | Offline rule-based engine for development & testing |
 `ph doctor` auto-detects all available backends and shows their status.
-When you run `ph init --agent claude-code`, PolyHarness automatically generates a `CLAUDE.md` instruction file in the workspace, telling the agent how to behave as an optimization Proposer. Same for `CLAW.md`, `CODEX.md`, `OPENCODE.md` — each agent's native instruction format.
+When you run `ph init --agent claude-code`, PolyHarness automatically generates a `CLAUDE.md` instruction file in the workspace, telling the agent how to behave as an optimization Proposer. Same for `CLAW.md`, `CODEX.md`, `AGENTS.md` (Hermes), `OPENCODE.md` — each agent's native instruction format.
+#### Backend ensemble (adaptive selection)
+Don't know which backend writes the best harness changes for your task? Let PolyHarness find out. Pass several and it picks one per iteration with a **UCB bandit**, shifting picks toward whichever backend actually produces *improving* candidates:
+```bash
+ph run --ensemble "claude-code,codex,local"
+```
+At the end of the run you get a per-backend breakdown (picks + improve-rate). Selection is deterministic given the reward sequence, so runs stay reproducible. Inspired by ShinkaEvolve's adaptive LLM-ensemble selection.
 ### Local Model Setup
@@ -455,10 +533,16 @@ After `ph init`, the workspace has a `config.yaml` with these sections:
 search:
   max_iterations: 20          # Maximum search iterations
   early_stop_patience: 5      # Stop after N iterations with no improvement
-  parent_selection: best       # Strategy: best | tournament | all
+  parent_selection: best       # Strategy: best | tournament | all | pareto
+  novelty_filter: false        # Reject near-duplicate candidates before eval (saves budget)
+  novelty_threshold: 0.97      # Similarity ratio above which a candidate is a near-duplicate
+  novelty_max_retries: 1       # Regenerate a near-duplicate this many times before skipping
+  seed: null                   # RNG seed — set an int to make randomized runs reproducible
 proposer:
-  backend: api                 # api | openai | claude-code | claw-code | codex | opencode | local
+  backend: api                 # api | openai | claude-code | claw-code | codex | hermes | opencode | local
+  ensemble: []                 # If non-empty, pick among these backends per iteration via a UCB bandit
+  bandit_c: 1.41421356         # UCB exploration constant (higher = more exploration)
   model: claude-sonnet-4-20250514  # Model name (for api/openai backends)
   base_url: null               # Custom API endpoint (for openai backend)
   api_key: null                # API key override (null = use env var)
@@ -470,6 +554,9 @@ evaluator:
   type: python                 # python | docker | custom
   entry: evaluate.py           # Evaluator script entrypoint
   timeout: 300                 # Per-task timeout in seconds
+  cascade: false               # Stage cheap subset first; skip rest if it fails the gate (per-task mode)
+  cascade_threshold: 0.4       # Min stage-1 mean score required to run the full task set
+  cascade_stage1: 0            # Tasks in stage 1 (0 = auto, ~1/3 of the list)
 harness:
   language: python             # Harness code language
@@ -537,11 +624,11 @@ python -m polyharness --version
 | `ph init` | Initialize workspace with auto-copy of harness, tasks, eval script |
 | `ph run` | Start the optimization search loop |
 | `ph status` | Progress table with elapsed time, improvement rate, and delta |
-| `ph log` | Search tree with delta (Δ) column (or `--flat` for table) |
+| `ph log` | Search tree with delta (Δ) column and Pareto-frontier (◆) markers (or `--flat` for table) |
 | `ph best` | Show best candidate: score, per-task breakdown, changes summary |
 | `ph compare A B` | Compare two iterations: score deltas + unified code diff |
 | `ph diff <N>` | Shorthand for `compare 0 <N>` |
-| `ph leaderboard` | Ranked table of all candidates (`--top N`, `--tasks` drilldown) |
+| `ph leaderboard` | Ranked table of all candidates with Pareto (◆) and backend columns (`--top N`, `--tasks` drilldown) |
 | `ph trace <N>` | View stdout, stderr, metrics, exit code for an iteration |
 | `ph report` | Generate a full markdown report with score trends and per-task table |
 | `ph apply` | Copy best harness back to `base_harness/` (or `--target` dir) |
@@ -555,6 +642,9 @@ python -m polyharness --version
 | `ph traces stats` | Summary statistics: total traces, scored count, agent distribution |
 | `ph traces clear` | Remove collected traces (`--keep N` to retain newest, `-y` to skip confirm) |
 | `ph evolve` | Trigger an online evolution cycle using collected traces as context |
+| `ph shell-hook install` | Install shell hook to auto-wrap agent commands (claude, claw, codex, opencode) |
+| `ph shell-hook uninstall` | Remove the shell hook from your rc file |
+| `ph shell-hook status` | Check if the shell hook is installed |
 | `ph upgrade` | Upgrade PolyHarness to the latest version |
 | `ph uninstall` | Uninstall PolyHarness from the current environment (`-y` to skip confirm) |
@@ -582,7 +672,8 @@ python -m polyharness --version
 --dry-run            Only evaluate the base harness, skip search
 --resume             Continue an interrupted search from where it left off
 --backend <name>     Override proposer backend without editing config
---strategy <name>    Override parent selection: best | tournament | all
+--strategy <name>    Override parent selection: best | tournament | all | pareto
+--ensemble b1,b2,... Pick among multiple backends per iteration via a UCB bandit
 ```
 ### `ph wrap` options
@@ -664,7 +755,7 @@ ph run --max-iterations 5
 ```
 polyharness/
 ├── src/polyharness/
-│   ├── cli.py                   # Click CLI — 22 commands/subcommands
+│   ├── cli.py                   # Click CLI — 25 commands/subcommands
 │   ├── config.py                # Pydantic config models (+ EvolutionConfig)
 │   ├── collector.py             # Trace collector for online evolution
 │   ├── orchestrator.py          # Meta-Harness search loop + progress bar + error recovery
@@ -682,6 +773,7 @@ polyharness/
 │   │       ├── claude_code.py   # claude -p
 │   │       ├── claw_code.py     # claw -p
 │   │       ├── codex.py         # codex --quiet --auto-edit
+│   │       ├── hermes.py        # hermes chat -q
 │   │       └── opencode.py      # opencode -p
 │   └── templates/               # 5 built-in task templates
 │       ├── text-classification/
@@ -689,7 +781,7 @@ polyharness/
 │       ├── code-generation/
 │       ├── rag-qa/
 │       └── api-calling/
-├── tests/                       # 165 tests (pytest)
+├── tests/                       # 173 tests (pytest)
 ├── bin/                         # npm wrapper (ph.mjs, postinstall.mjs)
 ├── docs/
 │   ├── development/             # Product roadmap & technical architecture

{polyharness-0.2.0 → polyharness-0.2.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "polyharness"
-version = "0.2.0"
+version = "0.2.2"
 description = "Automated harness optimization for AI agents — make your agent evolve."
 readme = "README.md"
 license = "MIT"

{polyharness-0.2.0 → polyharness-0.2.2}/src/polyharness/__init__.py RENAMED Viewed

@@ -1,3 +1,3 @@
 """PolyHarness — Automated harness optimization for AI agents."""
-__version__ = "0.2.0"
+__version__ = "0.2.2"

polyharness 0.2.0__tar.gz → 0.2.2__tar.gz

polyharness 0.2.0tar.gz → 0.2.2tar.gz