PyPI - simplicio-cli - Versions diffs - 0.2.12__tar.gz → 0.4.1__tar.gz - Mend

simplicio-cli 0.2.12tar.gz → 0.4.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (36) hide show

{simplicio_cli-0.2.12/simplicio_cli.egg-info → simplicio_cli-0.4.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: simplicio-cli
-Version: 0.2.12
+Version: 0.4.1
 Summary: Portable task-to-code pipeline that works with any LLM. Turn a one-line task into a verified code change — diff + test + verify loop. +55 pts on a 156-check benchmark, 21% faster, ~same tokens.
 Author-email: Wesley Simplicio <wesleybob4@gmail.com>
 License: MIT
@@ -31,6 +31,11 @@ Requires-Dist: sentence-transformers>=2.2
 Requires-Dist: numpy>=1.23
 Requires-Dist: anthropic>=0.30
 Requires-Dist: openai>=1.30
+Requires-Dist: simplicio-mapper>=0.6.0
+Requires-Dist: simplicio-prompt>=1.9.0
+Requires-Dist: httpx>=0.27
+Requires-Dist: orjson>=3.10
+Requires-Dist: diskcache>=5.6
 Provides-Extra: bench
 Requires-Dist: fpdf2>=2.7; extra == "bench"
 Dynamic: license-file
@@ -46,7 +51,9 @@ Dynamic: license-file
 [![simplicio-cli pipeline hero: one-line task to verified code change](https://raw.githubusercontent.com/wesleysimplicio/simplicio-cli/master/output/imagegen/simplicio-cli-readme-hero-web.png)](output/imagegen/simplicio-cli-readme-hero.png)
 > *"hide the Delete button for non-admins"* → diff + test + applied + verified.
-> Works with **OpenRouter, OpenAI, Anthropic, GLM, DeepSeek, Ollama** — one env var.
+> **Zero API key inside Claude Code** (auto-installs, uses your subscription) — or
+> bring your own key for any provider: OpenRouter, OpenAI, Anthropic, GLM,
+> DeepSeek, Ollama.
 ```bash
 pip install simplicio-cli
@@ -56,10 +63,125 @@ pip install simplicio-cli
 ## Why it works — the numbers
+Two complementary benchmarks measure different things. Read them in order.
+### 1. Execution benchmark — real project, real tasks, real test suite (the "does it work" answer)
+**This is not regex pattern-matching. This is not a synthetic toy harness in
+isolation.** Run against [`wesleysimplicio/sistema-sindico`](https://github.com/wesleysimplicio/sistema-sindico)
+— a real condominium-management system in pure PHP 8, public on GitHub, with a
+real PHPUnit suite (`vendor/bin/phpunit --configuration phpunit.xml.dist`).
+For each task the model is asked for a **real engineering change** — add a new
+method to an existing production class (permission helper, env parser,
+rate-limit key builder, repository SQL builder, route introspection, etc.).
+The generated file replaces the original in a working copy of the real repo;
+a **hidden PHPUnit test** (never shown to the model, asserting BOTH true and
+false states of the required behaviour) is dropped into
+`tests/unit/Core/Hidden/`; the **entire production suite runs** (every
+pre-existing test of the real codebase plus the hidden one). **Pass =
+`phpunit` exit code 0** — the same green/red signal the project's CI would use
+to merge a PR. The model's change must be *correct* (the new test passes) AND
+must *not break existing behaviour* (every prior test still passes).
+All sides emit the complete file (identical output shape); the only variable
+is the wrapping prompt.
+4 tasks · **9 models** (3 small · 3 mid · 3 frontier) · 2 sides = **36 runs per side**, scored by `vendor/bin/phpunit` exit code on 2026-05-28. Both sides emit the complete file; the only variable is whether the goal is wrapped in the simplicio contract:
+| Tier | Model | Without simplicio | With simplicio | Gain |
+|---|---|---|---|---|
+| small | **Llama 3.2 1B** (`meta-llama/Llama-3.2-1B-Instruct`) | 0% | 0% | 0 pts |
+| small | **Gemma 3n e4B** (`google/gemma-3n-E4B-it`) | 0% | 0% | 0 pts |
+| small | **Gemma 3 4B** (`google/gemma-3-4b-it`) | 0% | **75%** | **+75 pts** |
+| mid | **Qwen 2.5 7B** (`qwen/qwen-2.5-7b-instruct`) | 0% | **25%** | **+25 pts** |
+| mid | **Llama 3.1 8B** (`meta-llama/Llama-3.1-8B-Instruct`) | 50% | **100%** | **+50 pts** |
+| mid | **Gemma 3 12B** (`google/gemma-3-12b-it`) | 50% | **75%** | **+25 pts** |
+| frontier | **Gemini 3.5 Flash** (`google/gemini-3.5-flash`) | 75% | **100%** | **+25 pts** |
+| frontier | **Claude Opus 4.7** (`anthropic/claude-opus-4.7`) | 50% | **100%** | **+50 pts** |
+| frontier | **GPT-5.5** (`openai/gpt-5.5`) | 75% | **100%** | **+25 pts** |
+| **Headline (9 models · 4 tasks · 36 runs/side)** | | **33%** | **64%** | **+31 pts** |
+> Every model with baseline capability to emit valid PHP gains **+25 to +75
+> points** when the task is wrapped in the simplicio contract. The **two
+> sub-2B/4B-MoE models score 0% on both sides** — they can't produce a
+> parseable PHP file regardless of prompt — so the contract has nothing to
+> amplify. Honest scope: simplicio multiplies capable models, it does not
+> create capability in tiny ones. Three frontier models hit **100%** with the
+> contract.
+Full report: [`bench/results_exec_sindico.md`](bench/results_exec_sindico.md) ·
+[`bench/results_exec_sindico.pdf`](bench/results_exec_sindico.pdf). Reproduce:
+clone `sistema-sindico` (public), `composer install`, then
+`BENCH_BASE_URL=… BENCH_API_KEY=… BENCH_MODELS=…
+python3 bench/run_exec_sindico.py`. Hidden tests live under
+`bench/sindico_hidden/`; harness in `bench/run_exec_sindico.py`.
+### 2. Contract-adherence benchmark — structural checks across many models
+The tables below measure something **narrower and complementary**: did the
+model produce **the right shape of actionable output** (target-file mention +
+DIFF block + TEST block + contract-state keywords) on a raw one-line prompt
+vs. the simplicio contract. Scoring is via **deterministic regex** on the
+output — it's not a proof that the code compiles or passes runtime tests.
+That's what the execution benchmark above is for. The two answer different
+questions: this one measures *contract adherence at scale across many models*;
+the execution one measures *runtime correctness on a real codebase*.
 Same model. Same task. Only the prompt changes. **Measured, reproducible, deterministic.**
-**Fourteen models tested across three runs** — five sub-4B tiny models, six
-frontier 2026 models, and three mid-tier 7B–12B open models. Every one gained
-at least **+14 points** when wrapped in simplicio's 6-layer contract.
+**Seventeen models tested across four runs** — three local Ollama models on an
+M1 MacBook (8 GB), five sub-4B tiny models, six frontier 2026 models, and three
+mid-tier 7B–12B open models. Every one gained at least **+14 points** when
+wrapped in simplicio's 6-layer contract.
+#### Hugging Face — Qwen2.5-Coder, re-run on 2026-05-27 (latest mapper, 10 cases/side, 156 checks)
+First batch of the smaller→larger re-benchmark against the latest
+`simplicio-mapper` artifacts. The 1.5B runs on CPU via `transformers`
+(Hugging Face Inference Providers does not serve it); the 3B and 7B run
+through the HF router (`https://router.huggingface.co/v1`).
+| Model | Without simplicio | With simplicio | Gain |
+|---|---|---|---|
+| **Qwen 2.5 Coder 7B** (`Qwen/Qwen2.5-Coder-7B-Instruct`) | 38% | **96%** | **+58 pts** |
+| **Qwen 2.5 Coder 3B** (`Qwen/Qwen2.5-Coder-3B-Instruct`) | 34% | **94%** | **+60 pts** |
+| **Qwen 2.5 Coder 1.5B** (`Qwen/Qwen2.5-Coder-1.5B-Instruct`, local CPU) | 30% | **92%** | **+62 pts** |
+| **HF avg (3 models · 10 cases · 156 checks)** | **34%** | **94%** | **+60 pts (+172%)** |
+> Monotonic from smaller to larger: pass-rate with simplicio climbs **92% →
+> 94% → 96%** as the model grows, while the raw-prompt baseline stays at
+> **30–38%**. The 1.5B model gains the most (**+62 pts**) — the contract does
+> the heaviest lifting where the model is weakest. Reproduce:
+> `BENCH_BASE_URL=https://router.huggingface.co/v1 BENCH_API_KEY=<hf-token>
+> BENCH_MODELS="local:Qwen/Qwen2.5-Coder-1.5B-Instruct,Qwen/Qwen2.5-Coder-3B-Instruct,Qwen/Qwen2.5-Coder-7B-Instruct"
+> python3 bench/run_offline.py`.
+Side-by-side delta vs the previously published numbers (same regex methodology,
+all 17 README models re-measured):
+[`bench/results_comparison.md`](bench/results_comparison.md) ·
+[`bench/results_comparison.pdf`](bench/results_comparison.pdf). Headline on the
+14 models with clean data: **with simplicio averaged 86% → 88% (+2 pts); without
+simplicio 36% → 36% (+1 pt)** — the new run reproduces the published numbers
+within noise. Three frontier models (Claude Opus 4.7, Qwen 3.7 Max, DeepSeek V4
+Pro) show `n/a` for the new column: their OpenRouter calls hit account-level
+HTTP 402 / provider failures on >50% of requests this round, so the sample is
+too small to publish; their old numbers still stand.
+#### Local offline — qwen2.5-coder on Ollama, M1 8 GB, run on 2026-05-27 (30 runs/side, 156 checks)
+| Model | Without simplicio | With simplicio | Gain |
+|---|---|---|---|
+| **Qwen 2.5 Coder 7B** (`qwen2.5-coder:7b`) | 36% | **92%** | **+56 pts** |
+| **Qwen 2.5 Coder 3B** (`qwen2.5-coder:3b`) | 34% | **82%** | **+48 pts** |
+| **Qwen 2.5 Coder 1.5B** (`qwen2.5-coder:1.5b`) | 32% | **88%** | **+56 pts** |
+| **Local avg (3 models · 10 cases · 156 checks)** | **34%** | **87%** | **+53 pts (+156%)** |
+> **Zero API key, zero network.** Bench ran fully offline against
+> `http://localhost:11434/v1` (Ollama's OpenAI-compatible endpoint). A
+> 1.5B-param model running on a 4-year-old laptop reaches **88%** pass-rate
+> with simplicio's contract — same hardware, same model, raw prompt = 32%.
+> Reproduce: `BENCH_BASE_URL=http://localhost:11434/v1 BENCH_API_KEY=ollama
+> BENCH_MODELS="qwen2.5-coder:7b" python3 bench/run_offline.py`.
 #### Tiny models — sub-4B, run on 2026-05-26 (50 runs/side, 260 checks)
@@ -98,11 +220,12 @@ at least **+14 points** when wrapped in simplicio's 6-layer contract.
 | **Qwen 2.5 7B** (`qwen/qwen-2.5-7b-instruct`) | 34% | **88%** | **+54 pts** |
 | **Mid-tier avg (3 models · 10 cases · 156 checks)** | **35%** | **90%** | **+55 pts (+156%)** |
-> **Across all 14 models tested across three runs**, the average gain is **+51
+> **Across all 17 models tested across four runs**, the average gain is **+51
 > points**. Smallest: **+14 pts** (Llama 3.2 1B — the contract still moves a
-> 1B-param model). Largest: **+62 pts** (GPT-5.5). The contract helps tiny
-> sub-4B models, frontier reasoning models, and mid-tier 7B–12B alike — five
-> of the six frontier models hit **100% pass-rate**.
+> 1B-param model). Largest: **+62 pts** (GPT-5.5). The contract helps local
+> Ollama models on a 4-year-old laptop, tiny sub-4B models, frontier reasoning
+> models, and mid-tier 7B–12B alike — five of the six frontier models hit
+> **100% pass-rate**.
 ### Output-quality signals (rate across all 60 frontier runs)
@@ -149,6 +272,26 @@ test          JUDGE   contract written as testable states
 verify        PROOF   ran it — did it actually pass? loop-fix up to 3x
 ```
+### Rich mapper integration
+When `simplicio-mapper` has generated `.simplicio/project-map.json` and
+`.simplicio/precedent-index.json`, `simplicio-cli` consumes them directly:
+- exact target file metadata, roles, imports and exports
+- entry points, test files, modules, entities and architecture signals
+- recent changes and changed-file context
+- precedent snippets ranked from `precedent-index.json`
+If those artifacts are missing, the CLI falls back to the older target-file
+inspection path, so existing projects keep working.
+### Adaptive retry and observability
+The retry loop now validates generated output before applying/testing it,
+classifies failures, and sends targeted retry feedback. Bench and pipeline runs
+can append lightweight JSONL records to `.simplicio/runs.jsonl` with prompt
+variant, model/provider, estimated tokens, target, mode and failure class.
 **The idea in one line: don't ask the model to guess — hand it the path.**
 Each layer terminates one decision the model would otherwise hallucinate.
 Relevant > complete — inject the *right* context, never *all* of it.
@@ -158,12 +301,109 @@ Relevant > complete — inject the *right* context, never *all* of it.
 ## Install
 ```bash
-pip install simplicio-cli           # from PyPI
+pip install simplicio-cli           # from PyPI (pulls simplicio-mapper + simplicio-prompt)
 # or
 pip install -e .                    # from this repo
 ```
-### Auto-activation in Claude Code (often zero-step)
+The install ships **three Simplicio packages** that play distinct roles:
+- **`simplicio-cli`** (this repo) — the 6-layer task contract + verify loop.
+  The default wrapper for one-shot code edits. *Headline: +31 pts vs raw
+  baseline on real PHPUnit (see Section 1).*
+- **`simplicio-mapper`** — emits `.simplicio/project-map.json` and
+  `precedent-index.json` so the CLI can target the right file/precedent
+  without guessing.
+- **`simplicio-prompt`** (≥1.7.0) — the Tuple-Space + Yool agent runtime
+  kernel (`kernel.subagent_runtime.SubagentRuntime`) for orchestrated work:
+  real parallel subagent fan-out on any OpenAI-compatible provider, with
+  bounded lane concurrency, a receipt cache, jittered backoff and a
+  circuit breaker. *On one-shot code tasks it's net-neutral and not the
+  right tool (use simplicio-cli for those); on orchestrated multi-step /
+  fan-out work it's the engine.* Our chosen fan-out default for this
+  project is **N=200** subagents — the level where harder tasks start to
+  recover from per-call noise (partial Qwen2.5-Coder-3B data:
+  `env_get_int` at N=64 → 0 PHPUnit passes of 64; at higher N some tasks
+  flip to passing). The fan-out benchmark
+  (`bench/run_fanout.py`) measures both real PHPUnit pass-rate and a
+  structural regex check on every subagent and surfaces the gap; full
+  ongoing numbers in [`bench/results_fanout.md`](bench/results_fanout.md) ·
+  [`bench/results_fanout.pdf`](bench/results_fanout.pdf).
+Each is independently published on PyPI; ship them as a set so the CLI's
+mapper-rich precedent ranking, contract-shaped prompts, and (when called
+for) real subagent fan-out all work out of the box without extra setup.
+---
+## How you use it — pick your path
+simplicio-cli has **three distinct entry points**. Same engine, three front doors — pick the one that matches what you already pay for:
+| You have | Path | LLM call goes through | Need API key? |
+|---|---|---|---|
+| **Claude Code** (Pro / Max / Team / API) | Skill + hook auto-installed in `~/.claude/` | Claude Code itself, using your logged-in session | **No** |
+| **Claude Code OAuth or Codex CLI / ChatGPT Plus** | `simplicio task` with `SIMPLICIO_MODEL=claude-cli/<m>` or `codex-cli/<m>` | Shell-out to `claude -p` / `codex exec` (subprocess uses your existing login) | **No** |
+| **API key** for any provider (Anthropic, OpenAI, OpenRouter, GLM, DeepSeek, Ollama…) | `simplicio task` standalone CLI | The provider SDK directly | **Yes** — set `SIMPLICIO_API_KEY` |
+**Most users land on Path 1.** `pip install simplicio-cli` puts the binary on PATH; the first invocation auto-installs the skill + hook in `~/.claude/` (idempotent, opt-out via `SIMPLICIO_SKIP_AUTO_INIT=1`). From that moment, every code-edit prompt you type **inside Claude Code** is silently routed through simplicio's 6-layer contract — no extra config, no key, no cost beyond your existing Claude subscription.
+**Path 2 — subscription shell-out (zero key).** If you have a Claude Pro/Max session (`claude login`) or a ChatGPT Plus + Codex CLI session (`codex login`) and want to drive simplicio from CI, scripts, or any context **outside** Claude Code, set `SIMPLICIO_MODEL=claude-cli/<model>` or `codex-cli/<model>`. simplicio spawns the CLI as a subprocess; the call rides your existing OAuth session — no API key required. A recursion guard (`SIMPLICIO_HOOK_GUARD=1`) is injected so the inner CLI does not re-fire simplicio's own hook.
+**Path 3 is for environments without any logged-in CLI** — a remote server, a build runner, a notebook, a different LLM provider. You bring an API key (Anthropic, OpenRouter, OpenAI, GLM, DeepSeek, Ollama…), simplicio calls the provider directly.
+### Path 1 example — inside Claude Code
+After `pip install simplicio-cli && simplicio smoke` (which triggers auto-bootstrap), just type your task in Claude Code:
+```
+hide the Delete button for non-admins in src/app/screen/screen.component.html
+```
+Claude Code sees the skill (semantic match) and the hook hint (`[SIMPLICIO_PROMPT_HINT]` on stderr — deterministic classifier). It runs simplicio's 6-layer contract under the hood. You see the diff + tests + verification — same as before, just dramatically more accurate.
+### Path 2 example — subscription shell-out, zero key
+You already pay for Claude Pro/Max or ChatGPT Plus + Codex CLI. simplicio
+piggybacks on that login — no extra bill, no key to manage.
+```bash
+# Option A — Claude Code subscription (run `claude login` once)
+export SIMPLICIO_MODEL=claude-cli/sonnet     # or claude-cli/opus, claude-cli/default
+unset  SIMPLICIO_API_KEY                     # explicitly: no key needed
+simplicio task "hide Delete button for non-admins" --stack angular \
+  --target src/app/screen/screen.component.html
+# Option B — Codex CLI subscription (run `codex login` once)
+export SIMPLICIO_MODEL=codex-cli/gpt-5       # or codex-cli/default
+simplicio task "..." --stack angular --target ...
+```
+How it works: simplicio shells out to `claude -p "<prompt>"` (or `codex exec "<prompt>"`) as a subprocess, captures stdout, runs the test loop. The inner CLI authenticates via your existing OAuth session in `~/.claude/` or `~/.codex/`. simplicio sets `SIMPLICIO_HOOK_GUARD=1` in the subprocess env so the inner Claude Code session does **not** re-fire simplicio's own UserPromptSubmit hook (no infinite recursion).
+### Path 3 example — standalone with API key
+```bash
+export SIMPLICIO_API_KEY=sk-or-v1-…                      # OpenRouter key
+export SIMPLICIO_MODEL=anthropic/claude-opus-4
+export SIMPLICIO_BASE_URL=https://openrouter.ai/api/v1
+simplicio index --stack angular                           # one-time, builds embedding cache
+simplicio task "hide Delete button for non-admins" \
+  --stack angular \
+  --target src/app/screen/screen.component.html \
+  --criteria "- no admin perm: button absent from DOM
+- with admin perm: button present" \
+  --constraints "- don't touch save flow
+- build passes"
+```
+Provider-agnostic — see [Configure](#configure--any-llm-nothing-hardcoded) for the full matrix.
+---
+### Path 1 deep-dive — auto-activation in Claude Code
 `pip install` puts `simplicio` on your PATH. To make Claude Code
 **automatically** route code-edit tasks through simplicio, a skill + hook
@@ -239,8 +479,14 @@ user prompt. UserPromptSubmit is the right pre-hook for routing decisions.
 | Preview without writing | `simplicio init --dry-run` |
 | Skill-only (no hook) | Copy `.skills/simplicio-cli/SKILL.md` to `~/.claude/skills/simplicio-cli/SKILL.md` manually, skip `simplicio init` |
+---
 ## Configure — any LLM, nothing hardcoded
+> Applies to **Path 2** (standalone CLI). Path 1 users can skip this entire
+> section — Claude Code handles the LLM call with the model and key already
+> tied to your subscription.
 | Provider | SIMPLICIO_MODEL | SIMPLICIO_BASE_URL |
 |---|---|---|
 | OpenRouter | `anthropic/claude-opus-4` | `https://openrouter.ai/api/v1` |
@@ -258,25 +504,53 @@ your `base_url` — so **any** OpenAI-like provider works without code changes.
 simplicio smoke      # prints provider config + one test call
 ```
-## Use
+### The pipeline (both paths)
-```bash
-# index once (caches embeddings; re-run after big changes)
-simplicio index --stack angular
+Whichever entry point you use, each task runs through the same engine:
-# run a task
-simplicio task "hide Delete button for non-admins" \
-  --stack angular \
-  --target src/app/screen/screen.component.html \
-  --criteria "- no admin perm: button absent from DOM
-- with admin perm: button present" \
-  --constraints "- don't touch save flow
-- build passes"
+```
+precedent (from cache)
+  → skill match
+  → 6-layer prompt
+  → LLM generates diff + test + Playwright
+  → apply diff
+  → run SIMPLICIO_TEST_CMD
+  → pass?  done  :  send the error back → fix → retry (up to 3x)
 ```
-Each `task`: precedent (from cache) → skill match → 6 layers → LLM generates
-(diff + test + Playwright) → apply → run `SIMPLICIO_TEST_CMD` → pass? **done** :
-send the error back → fix → retry (up to 3x).
+The 6-layer contract is what moves pass-rate from 41% to 99% on frontier
+models (see [the numbers](#why-it-works--the-numbers) above). The retry loop
+is what catches the remaining edge cases — measured separately in the
+[4-quadrant bench](#4-quadrant-bench--agent--simplicio-matrix).
+### Common questions
+**"I have a Claude Pro subscription but no API key — does this work?"** Yes,
+on Path 1. Install simplicio-cli, open Claude Code, type your task as normal.
+Claude Code makes the LLM call with your subscription; simplicio shapes the
+prompt. No key needed.
+**"I want to run it in CI / a script / outside Claude Code."** Path 2. Get an
+API key from any of the providers above (OpenRouter is the cheapest way to
+try multiple models behind one key), set `SIMPLICIO_API_KEY` +
+`SIMPLICIO_MODEL` + optional `SIMPLICIO_BASE_URL`, run `simplicio task ...`.
+**"I have Codex CLI / ChatGPT Plus and don't want to pay for an API key."**
+Not auto-wired yet. Workarounds: (a) get an OpenRouter key (~$2 covers
+thousands of tasks at small-model rates), (b) wait for the shell-out provider
+that pipes through `claude -p` / `codex exec` using your subscription —
+tracked, not shipped.
+**"Will Claude Code use simplicio for *every* prompt now?"** No. The skill
+only triggers on prompts that look like code edits (the description is
+specific). The hook fires `simplicio detect` on every prompt but only emits
+a hint when the deterministic classifier scores ≥ 3 (verbs + file extensions
++ code nouns − read-only cues). "What does this function do?" gets no
+nudge. "Add a delete confirmation to UserList.tsx" does.
+**"How do I turn it off?"** See [Disable / re-enable](#disable--re-enable)
+above. Two ways: env var (`SIMPLICIO_SKIP_AUTO_INIT=1` before first call) or
+delete the hook entry from `~/.claude/settings.json`.
 ---

simplicio-cli 0.2.12__tar.gz → 0.4.1__tar.gz

simplicio-cli 0.2.12tar.gz → 0.4.1tar.gz