simplicio-cli 0.2.9__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (39) hide show
  1. simplicio_cli-0.4.0/PKG-INFO +705 -0
  2. simplicio_cli-0.4.0/README.md +663 -0
  3. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/pyproject.toml +11 -2
  4. simplicio_cli-0.4.0/simplicio/adaptive.py +60 -0
  5. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/simplicio/bench.py +32 -7
  6. simplicio_cli-0.4.0/simplicio/cli.py +120 -0
  7. simplicio_cli-0.4.0/simplicio/detect.py +139 -0
  8. simplicio_cli-0.4.0/simplicio/init.py +169 -0
  9. simplicio_cli-0.4.0/simplicio/mapper.py +207 -0
  10. simplicio_cli-0.4.0/simplicio/observability.py +32 -0
  11. simplicio_cli-0.4.0/simplicio/pipeline.py +100 -0
  12. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/simplicio/precedent.py +19 -0
  13. simplicio_cli-0.4.0/simplicio/prompt.py +56 -0
  14. simplicio_cli-0.4.0/simplicio/providers.py +142 -0
  15. simplicio_cli-0.4.0/simplicio/templates/SKILL.md +169 -0
  16. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/simplicio/templates/simplicio_prompt.md +2 -0
  17. simplicio_cli-0.4.0/simplicio/templates/userpromptsubmit-hook.sh +30 -0
  18. simplicio_cli-0.4.0/simplicio/utils/__init__.py +8 -0
  19. simplicio_cli-0.4.0/simplicio/utils/cache.py +84 -0
  20. simplicio_cli-0.4.0/simplicio/utils/http_client.py +72 -0
  21. simplicio_cli-0.4.0/simplicio/utils/serialization.py +45 -0
  22. simplicio_cli-0.4.0/simplicio_cli.egg-info/PKG-INFO +705 -0
  23. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/simplicio_cli.egg-info/SOURCES.txt +11 -0
  24. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/simplicio_cli.egg-info/requires.txt +5 -0
  25. simplicio_cli-0.2.9/PKG-INFO +0 -355
  26. simplicio_cli-0.2.9/README.md +0 -318
  27. simplicio_cli-0.2.9/simplicio/cli.py +0 -43
  28. simplicio_cli-0.2.9/simplicio/pipeline.py +0 -28
  29. simplicio_cli-0.2.9/simplicio/prompt.py +0 -25
  30. simplicio_cli-0.2.9/simplicio/providers.py +0 -63
  31. simplicio_cli-0.2.9/simplicio_cli.egg-info/PKG-INFO +0 -355
  32. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/LICENSE +0 -0
  33. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/setup.cfg +0 -0
  34. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/simplicio/__init__.py +0 -0
  35. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/simplicio/cache.py +0 -0
  36. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/simplicio/skill_router.py +0 -0
  37. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/simplicio_cli.egg-info/dependency_links.txt +0 -0
  38. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/simplicio_cli.egg-info/entry_points.txt +0 -0
  39. {simplicio_cli-0.2.9 → simplicio_cli-0.4.0}/simplicio_cli.egg-info/top_level.txt +0 -0
@@ -0,0 +1,705 @@
1
+ Metadata-Version: 2.4
2
+ Name: simplicio-cli
3
+ Version: 0.4.0
4
+ Summary: Portable task-to-code pipeline that works with any LLM. Turn a one-line task into a verified code change — diff + test + verify loop. +55 pts on a 156-check benchmark, 21% faster, ~same tokens.
5
+ Author-email: Wesley Simplicio <wesleybob4@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/wesleysimplicio/simplicio-cli
8
+ Project-URL: Repository, https://github.com/wesleysimplicio/simplicio-cli
9
+ Project-URL: Issues, https://github.com/wesleysimplicio/simplicio-cli/issues
10
+ Project-URL: Changelog, https://github.com/wesleysimplicio/simplicio-cli/releases
11
+ Keywords: llm,ai,agent,code-generation,prompt-engineering,openrouter,openai,anthropic,claude,developer-tools,cli,rag,embeddings,verify-loop,task-automation
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Environment :: Console
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3 :: Only
19
+ Classifier: Programming Language :: Python :: 3.9
20
+ Classifier: Programming Language :: Python :: 3.10
21
+ Classifier: Programming Language :: Python :: 3.11
22
+ Classifier: Programming Language :: Python :: 3.12
23
+ Classifier: Topic :: Software Development
24
+ Classifier: Topic :: Software Development :: Code Generators
25
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
26
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
27
+ Requires-Python: >=3.9
28
+ Description-Content-Type: text/markdown
29
+ License-File: LICENSE
30
+ Requires-Dist: sentence-transformers>=2.2
31
+ Requires-Dist: numpy>=1.23
32
+ Requires-Dist: anthropic>=0.30
33
+ Requires-Dist: openai>=1.30
34
+ Requires-Dist: simplicio-mapper>=0.5.0
35
+ Requires-Dist: simplicio-prompt>=1.7.0
36
+ Requires-Dist: httpx>=0.27
37
+ Requires-Dist: orjson>=3.10
38
+ Requires-Dist: diskcache>=5.6
39
+ Provides-Extra: bench
40
+ Requires-Dist: fpdf2>=2.7; extra == "bench"
41
+ Dynamic: license-file
42
+
43
+ # simplicio-cli
44
+
45
+ **Your tasks with 99% accuracy using any LLM (Claude, DeepSeek, Codex, Gemini, Hermes, OpenClaw, Cursor).**
46
+
47
+ [![PyPI](https://img.shields.io/pypi/v/simplicio-cli.svg)](https://pypi.org/project/simplicio-cli/)
48
+ [![Python](https://img.shields.io/pypi/pyversions/simplicio-cli.svg)](https://pypi.org/project/simplicio-cli/)
49
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
50
+
51
+ [![simplicio-cli pipeline hero: one-line task to verified code change](https://raw.githubusercontent.com/wesleysimplicio/simplicio-cli/master/output/imagegen/simplicio-cli-readme-hero-web.png)](output/imagegen/simplicio-cli-readme-hero.png)
52
+
53
+ > *"hide the Delete button for non-admins"* → diff + test + applied + verified.
54
+ > **Zero API key inside Claude Code** (auto-installs, uses your subscription) — or
55
+ > bring your own key for any provider: OpenRouter, OpenAI, Anthropic, GLM,
56
+ > DeepSeek, Ollama.
57
+
58
+ ```bash
59
+ pip install simplicio-cli
60
+ ```
61
+
62
+ ---
63
+
64
+ ## Why it works — the numbers
65
+
66
+ Two complementary benchmarks measure different things. Read them in order.
67
+
68
+ ### 1. Execution benchmark — real project, real tasks, real test suite (the "does it work" answer)
69
+
70
+ **This is not regex pattern-matching. This is not a synthetic toy harness in
71
+ isolation.** Run against [`wesleysimplicio/sistema-sindico`](https://github.com/wesleysimplicio/sistema-sindico)
72
+ — a real condominium-management system in pure PHP 8, public on GitHub, with a
73
+ real PHPUnit suite (`vendor/bin/phpunit --configuration phpunit.xml.dist`).
74
+
75
+ For each task the model is asked for a **real engineering change** — add a new
76
+ method to an existing production class (permission helper, env parser,
77
+ rate-limit key builder, repository SQL builder, route introspection, etc.).
78
+ The generated file replaces the original in a working copy of the real repo;
79
+ a **hidden PHPUnit test** (never shown to the model, asserting BOTH true and
80
+ false states of the required behaviour) is dropped into
81
+ `tests/unit/Core/Hidden/`; the **entire production suite runs** (every
82
+ pre-existing test of the real codebase plus the hidden one). **Pass =
83
+ `phpunit` exit code 0** — the same green/red signal the project's CI would use
84
+ to merge a PR. The model's change must be *correct* (the new test passes) AND
85
+ must *not break existing behaviour* (every prior test still passes).
86
+
87
+ All sides emit the complete file (identical output shape); the only variable
88
+ is the wrapping prompt.
89
+
90
+ 4 tasks · **9 models** (3 small · 3 mid · 3 frontier) · 2 sides = **36 runs per side**, scored by `vendor/bin/phpunit` exit code on 2026-05-28. Both sides emit the complete file; the only variable is whether the goal is wrapped in the simplicio contract:
91
+
92
+ | Tier | Model | Without simplicio | With simplicio | Gain |
93
+ |---|---|---|---|---|
94
+ | small | **Llama 3.2 1B** (`meta-llama/Llama-3.2-1B-Instruct`) | 0% | 0% | 0 pts |
95
+ | small | **Gemma 3n e4B** (`google/gemma-3n-E4B-it`) | 0% | 0% | 0 pts |
96
+ | small | **Gemma 3 4B** (`google/gemma-3-4b-it`) | 0% | **75%** | **+75 pts** |
97
+ | mid | **Qwen 2.5 7B** (`qwen/qwen-2.5-7b-instruct`) | 0% | **25%** | **+25 pts** |
98
+ | mid | **Llama 3.1 8B** (`meta-llama/Llama-3.1-8B-Instruct`) | 50% | **100%** | **+50 pts** |
99
+ | mid | **Gemma 3 12B** (`google/gemma-3-12b-it`) | 50% | **75%** | **+25 pts** |
100
+ | frontier | **Gemini 3.5 Flash** (`google/gemini-3.5-flash`) | 75% | **100%** | **+25 pts** |
101
+ | frontier | **Claude Opus 4.7** (`anthropic/claude-opus-4.7`) | 50% | **100%** | **+50 pts** |
102
+ | frontier | **GPT-5.5** (`openai/gpt-5.5`) | 75% | **100%** | **+25 pts** |
103
+ | **Headline (9 models · 4 tasks · 36 runs/side)** | | **33%** | **64%** | **+31 pts** |
104
+
105
+ > Every model with baseline capability to emit valid PHP gains **+25 to +75
106
+ > points** when the task is wrapped in the simplicio contract. The **two
107
+ > sub-2B/4B-MoE models score 0% on both sides** — they can't produce a
108
+ > parseable PHP file regardless of prompt — so the contract has nothing to
109
+ > amplify. Honest scope: simplicio multiplies capable models, it does not
110
+ > create capability in tiny ones. Three frontier models hit **100%** with the
111
+ > contract.
112
+
113
+ Full report: [`bench/results_exec_sindico.md`](bench/results_exec_sindico.md) ·
114
+ [`bench/results_exec_sindico.pdf`](bench/results_exec_sindico.pdf). Reproduce:
115
+ clone `sistema-sindico` (public), `composer install`, then
116
+ `BENCH_BASE_URL=… BENCH_API_KEY=… BENCH_MODELS=…
117
+ python3 bench/run_exec_sindico.py`. Hidden tests live under
118
+ `bench/sindico_hidden/`; harness in `bench/run_exec_sindico.py`.
119
+
120
+ ### 2. Contract-adherence benchmark — structural checks across many models
121
+
122
+ The tables below measure something **narrower and complementary**: did the
123
+ model produce **the right shape of actionable output** (target-file mention +
124
+ DIFF block + TEST block + contract-state keywords) on a raw one-line prompt
125
+ vs. the simplicio contract. Scoring is via **deterministic regex** on the
126
+ output — it's not a proof that the code compiles or passes runtime tests.
127
+ That's what the execution benchmark above is for. The two answer different
128
+ questions: this one measures *contract adherence at scale across many models*;
129
+ the execution one measures *runtime correctness on a real codebase*.
130
+
131
+ Same model. Same task. Only the prompt changes. **Measured, reproducible, deterministic.**
132
+ **Seventeen models tested across four runs** — three local Ollama models on an
133
+ M1 MacBook (8 GB), five sub-4B tiny models, six frontier 2026 models, and three
134
+ mid-tier 7B–12B open models. Every one gained at least **+14 points** when
135
+ wrapped in simplicio's 6-layer contract.
136
+
137
+ #### Hugging Face — Qwen2.5-Coder, re-run on 2026-05-27 (latest mapper, 10 cases/side, 156 checks)
138
+
139
+ First batch of the smaller→larger re-benchmark against the latest
140
+ `simplicio-mapper` artifacts. The 1.5B runs on CPU via `transformers`
141
+ (Hugging Face Inference Providers does not serve it); the 3B and 7B run
142
+ through the HF router (`https://router.huggingface.co/v1`).
143
+
144
+ | Model | Without simplicio | With simplicio | Gain |
145
+ |---|---|---|---|
146
+ | **Qwen 2.5 Coder 7B** (`Qwen/Qwen2.5-Coder-7B-Instruct`) | 38% | **96%** | **+58 pts** |
147
+ | **Qwen 2.5 Coder 3B** (`Qwen/Qwen2.5-Coder-3B-Instruct`) | 34% | **94%** | **+60 pts** |
148
+ | **Qwen 2.5 Coder 1.5B** (`Qwen/Qwen2.5-Coder-1.5B-Instruct`, local CPU) | 30% | **92%** | **+62 pts** |
149
+ | **HF avg (3 models · 10 cases · 156 checks)** | **34%** | **94%** | **+60 pts (+172%)** |
150
+
151
+ > Monotonic from smaller to larger: pass-rate with simplicio climbs **92% →
152
+ > 94% → 96%** as the model grows, while the raw-prompt baseline stays at
153
+ > **30–38%**. The 1.5B model gains the most (**+62 pts**) — the contract does
154
+ > the heaviest lifting where the model is weakest. Reproduce:
155
+ > `BENCH_BASE_URL=https://router.huggingface.co/v1 BENCH_API_KEY=<hf-token>
156
+ > BENCH_MODELS="local:Qwen/Qwen2.5-Coder-1.5B-Instruct,Qwen/Qwen2.5-Coder-3B-Instruct,Qwen/Qwen2.5-Coder-7B-Instruct"
157
+ > python3 bench/run_offline.py`.
158
+
159
+ Side-by-side delta vs the previously published numbers (same regex methodology,
160
+ all 17 README models re-measured):
161
+ [`bench/results_comparison.md`](bench/results_comparison.md) ·
162
+ [`bench/results_comparison.pdf`](bench/results_comparison.pdf). Headline on the
163
+ 14 models with clean data: **with simplicio averaged 86% → 88% (+2 pts); without
164
+ simplicio 36% → 36% (+1 pt)** — the new run reproduces the published numbers
165
+ within noise. Three frontier models (Claude Opus 4.7, Qwen 3.7 Max, DeepSeek V4
166
+ Pro) show `n/a` for the new column: their OpenRouter calls hit account-level
167
+ HTTP 402 / provider failures on >50% of requests this round, so the sample is
168
+ too small to publish; their old numbers still stand.
169
+
170
+ #### Local offline — qwen2.5-coder on Ollama, M1 8 GB, run on 2026-05-27 (30 runs/side, 156 checks)
171
+
172
+ | Model | Without simplicio | With simplicio | Gain |
173
+ |---|---|---|---|
174
+ | **Qwen 2.5 Coder 7B** (`qwen2.5-coder:7b`) | 36% | **92%** | **+56 pts** |
175
+ | **Qwen 2.5 Coder 3B** (`qwen2.5-coder:3b`) | 34% | **82%** | **+48 pts** |
176
+ | **Qwen 2.5 Coder 1.5B** (`qwen2.5-coder:1.5b`) | 32% | **88%** | **+56 pts** |
177
+ | **Local avg (3 models · 10 cases · 156 checks)** | **34%** | **87%** | **+53 pts (+156%)** |
178
+
179
+ > **Zero API key, zero network.** Bench ran fully offline against
180
+ > `http://localhost:11434/v1` (Ollama's OpenAI-compatible endpoint). A
181
+ > 1.5B-param model running on a 4-year-old laptop reaches **88%** pass-rate
182
+ > with simplicio's contract — same hardware, same model, raw prompt = 32%.
183
+ > Reproduce: `BENCH_BASE_URL=http://localhost:11434/v1 BENCH_API_KEY=ollama
184
+ > BENCH_MODELS="qwen2.5-coder:7b" python3 bench/run_offline.py`.
185
+
186
+ #### Tiny models — sub-4B, run on 2026-05-26 (50 runs/side, 260 checks)
187
+
188
+ | Model | Without simplicio | With simplicio | Gain |
189
+ |---|---|---|---|
190
+ | **Gemma 3 4B** (`google/gemma-3-4b-it`) | 38% | **96%** | **+58 pts** |
191
+ | **Llama 3.2 3B** (`meta-llama/llama-3.2-3b-instruct`) | 28% | **73%** | **+45 pts** |
192
+ | **Gemma 3n e4B** (`google/gemma-3n-e4b-it`) | 44% | **88%** | **+44 pts** |
193
+ | **Phi-4 mini** (`microsoft/phi-4-mini-instruct`) | 36% | **73%** | **+37 pts** |
194
+ | **Llama 3.2 1B** (`meta-llama/llama-3.2-1b-instruct`) | 26% | **40%** | **+14 pts** |
195
+ | **Tiny avg (5 models · 10 cases · 260 checks)** | **35%** | **74%** | **+39 pts (+112%)** |
196
+
197
+ > **Not hosted on OpenRouter** (requested but skipped): Gemma 3 270M, Gemma 3 1B,
198
+ > Gemma 2 2B, Qwen3 0.6B, Qwen3 1.7B, Qwen2.5 0.5B, Qwen2.5 1.5B, Qwen 3B,
199
+ > Nemotron Nano 4B (OR's smallest Nemotron is 9B). Sub-4B substitutes used above.
200
+ > simplicio still gains **+14 to +58 points** even on a 1B-param model.
201
+
202
+ #### Frontier 2026 models — run on 2026-05-26 (60 runs/side, 312 checks)
203
+
204
+ | Model | Without simplicio | With simplicio | Gain |
205
+ |---|---|---|---|
206
+ | **GPT-5.5** (`openai/gpt-5.5`) | 38% | **100%** | **+62 pts** |
207
+ | **Kimi K2.6** (`moonshotai/kimi-k2.6`) | 40% | **100%** | **+60 pts** |
208
+ | **Gemini 3.5 Flash** (`google/gemini-3.5-flash`) | 42% | **100%** | **+58 pts** |
209
+ | **Qwen 3.7 Max** (`qwen/qwen3.7-max`) | 44% | **100%** | **+56 pts** |
210
+ | **Claude Opus 4.7** (`anthropic/claude-opus-4.7`) | 42% | **98%** | **+56 pts** |
211
+ | **DeepSeek V4 Pro** (`deepseek/deepseek-v4-pro`) | 44% | **96%** | **+52 pts** |
212
+ | **Frontier avg (6 models · 10 cases · 312 checks)** | **41%** | **99%** | **+58 pts (+136%)** |
213
+
214
+ #### Mid-tier 7B–12B open models — earlier run (v0.2.2, 30 runs/side, 156 checks)
215
+
216
+ | Model | Without simplicio | With simplicio | Gain |
217
+ |---|---|---|---|
218
+ | **Gemma 3 12B** (`google/gemma-3-12b-it`) | 34% | **92%** | **+58 pts** |
219
+ | **Llama 3.1 8B** (`meta-llama/llama-3.1-8b-instruct`) | 36% | **90%** | **+54 pts** |
220
+ | **Qwen 2.5 7B** (`qwen/qwen-2.5-7b-instruct`) | 34% | **88%** | **+54 pts** |
221
+ | **Mid-tier avg (3 models · 10 cases · 156 checks)** | **35%** | **90%** | **+55 pts (+156%)** |
222
+
223
+ > **Across all 17 models tested across four runs**, the average gain is **+51
224
+ > points**. Smallest: **+14 pts** (Llama 3.2 1B — the contract still moves a
225
+ > 1B-param model). Largest: **+62 pts** (GPT-5.5). The contract helps local
226
+ > Ollama models on a 4-year-old laptop, tiny sub-4B models, frontier reasoning
227
+ > models, and mid-tier 7B–12B alike — five of the six frontier models hit
228
+ > **100% pass-rate**.
229
+
230
+ ### Output-quality signals (rate across all 60 frontier runs)
231
+
232
+ | Signal | Raw prompt | With simplicio |
233
+ |---|---|---|
234
+ | **DIFF block present** | 36% | **98%** |
235
+ | Target file mentioned | 1% | **100%** |
236
+ | TEST block present | 88% | **98%** |
237
+
238
+ ### Cost — tokens & wall-clock (measured, not estimated)
239
+
240
+ Same provider, same models, same cases. Token counts pulled from the API
241
+ `usage` field; latency from `time.perf_counter()` around each call.
242
+
243
+ | Side | Tokens / run | Wall-clock / run | Total tokens (60 runs) | Total time |
244
+ |---|---|---|---|---|
245
+ | Raw prompt | 1,967 | 46.1s | 118,040 | 46m 07s |
246
+ | With simplicio | **3,168** | **57.6s** | **190,119** | **57m 33s** |
247
+ | Δ | **+61%** | **+24%** | +72,079 | +11m 26s |
248
+
249
+ simplicio wraps the objective in a 6-layer contract — more input tokens up
250
+ front, longer completions because the model produces the full DIFF + TEST +
251
+ EVIDENCE the contract demands instead of a one-line guess. The bill goes up,
252
+ but so does the **pass-rate (41% → 99%)** and the **DIFF-block rate (36% → 98%)** —
253
+ useful tokens, not chat.
254
+
255
+ > Six frontier models — GPT-5.5, Kimi K2.6, Gemini 3.5 Flash, Qwen 3.7 Max,
256
+ > Claude Opus 4.7, DeepSeek V4 Pro — gained **+52 to +62 points** when wrapped
257
+ > in simplicio's 6-layer contract. Without changing the model. Without
258
+ > fine-tuning. Five of six landed at **100% pass-rate with simplicio**.
259
+
260
+ Full report: [`bench/results.md`](bench/results.md) · [`bench/results.pdf`](bench/results.pdf) · raw outputs under `.simplicio/bench_runs/`.
261
+
262
+ ---
263
+
264
+ ## How it works
265
+
266
+ ```
267
+ mapper WHERE project structure + latest state
268
+ precedent HOW-1 the real snippet in THIS repo that already does it
269
+ skill-router HOW-2 the ONE mapper skill that matches (ranked, not all)
270
+ simplicio BUILD stacks the 6 layers into one prompt (cache-friendly)
271
+ test JUDGE contract written as testable states
272
+ verify PROOF ran it — did it actually pass? loop-fix up to 3x
273
+ ```
274
+
275
+ ### Rich mapper integration
276
+
277
+ When `simplicio-mapper` has generated `.simplicio/project-map.json` and
278
+ `.simplicio/precedent-index.json`, `simplicio-cli` consumes them directly:
279
+
280
+ - exact target file metadata, roles, imports and exports
281
+ - entry points, test files, modules, entities and architecture signals
282
+ - recent changes and changed-file context
283
+ - precedent snippets ranked from `precedent-index.json`
284
+
285
+ If those artifacts are missing, the CLI falls back to the older target-file
286
+ inspection path, so existing projects keep working.
287
+
288
+ ### Adaptive retry and observability
289
+
290
+ The retry loop now validates generated output before applying/testing it,
291
+ classifies failures, and sends targeted retry feedback. Bench and pipeline runs
292
+ can append lightweight JSONL records to `.simplicio/runs.jsonl` with prompt
293
+ variant, model/provider, estimated tokens, target, mode and failure class.
294
+
295
+ **The idea in one line: don't ask the model to guess — hand it the path.**
296
+ Each layer terminates one decision the model would otherwise hallucinate.
297
+ Relevant > complete — inject the *right* context, never *all* of it.
298
+
299
+ ---
300
+
301
+ ## Install
302
+
303
+ ```bash
304
+ pip install simplicio-cli # from PyPI (pulls simplicio-mapper + simplicio-prompt)
305
+ # or
306
+ pip install -e . # from this repo
307
+ ```
308
+
309
+ The install ships **three Simplicio packages** that play distinct roles:
310
+
311
+ - **`simplicio-cli`** (this repo) — the 6-layer task contract + verify loop.
312
+ The default wrapper for one-shot code edits. *Headline: +31 pts vs raw
313
+ baseline on real PHPUnit (see Section 1).*
314
+ - **`simplicio-mapper`** — emits `.simplicio/project-map.json` and
315
+ `precedent-index.json` so the CLI can target the right file/precedent
316
+ without guessing.
317
+ - **`simplicio-prompt`** (≥1.7.0) — the Tuple-Space + Yool agent runtime
318
+ kernel (`kernel.subagent_runtime.SubagentRuntime`) for orchestrated work:
319
+ real parallel subagent fan-out on any OpenAI-compatible provider, with
320
+ bounded lane concurrency, a receipt cache, jittered backoff and a
321
+ circuit breaker. *On one-shot code tasks it's net-neutral and not the
322
+ right tool (use simplicio-cli for those); on orchestrated multi-step /
323
+ fan-out work it's the engine.* Our chosen fan-out default for this
324
+ project is **N=200** subagents — the level where harder tasks start to
325
+ recover from per-call noise (partial Qwen2.5-Coder-3B data:
326
+ `env_get_int` at N=64 → 0 PHPUnit passes of 64; at higher N some tasks
327
+ flip to passing). The fan-out benchmark
328
+ (`bench/run_fanout.py`) measures both real PHPUnit pass-rate and a
329
+ structural regex check on every subagent and surfaces the gap; full
330
+ ongoing numbers in [`bench/results_fanout.md`](bench/results_fanout.md) ·
331
+ [`bench/results_fanout.pdf`](bench/results_fanout.pdf).
332
+
333
+ Each is independently published on PyPI; ship them as a set so the CLI's
334
+ mapper-rich precedent ranking, contract-shaped prompts, and (when called
335
+ for) real subagent fan-out all work out of the box without extra setup.
336
+
337
+ ---
338
+
339
+ ## How you use it — pick your path
340
+
341
+ simplicio-cli has **three distinct entry points**. Same engine, three front doors — pick the one that matches what you already pay for:
342
+
343
+ | You have | Path | LLM call goes through | Need API key? |
344
+ |---|---|---|---|
345
+ | **Claude Code** (Pro / Max / Team / API) | Skill + hook auto-installed in `~/.claude/` | Claude Code itself, using your logged-in session | **No** |
346
+ | **Claude Code OAuth or Codex CLI / ChatGPT Plus** | `simplicio task` with `SIMPLICIO_MODEL=claude-cli/<m>` or `codex-cli/<m>` | Shell-out to `claude -p` / `codex exec` (subprocess uses your existing login) | **No** |
347
+ | **API key** for any provider (Anthropic, OpenAI, OpenRouter, GLM, DeepSeek, Ollama…) | `simplicio task` standalone CLI | The provider SDK directly | **Yes** — set `SIMPLICIO_API_KEY` |
348
+
349
+ **Most users land on Path 1.** `pip install simplicio-cli` puts the binary on PATH; the first invocation auto-installs the skill + hook in `~/.claude/` (idempotent, opt-out via `SIMPLICIO_SKIP_AUTO_INIT=1`). From that moment, every code-edit prompt you type **inside Claude Code** is silently routed through simplicio's 6-layer contract — no extra config, no key, no cost beyond your existing Claude subscription.
350
+
351
+ **Path 2 — subscription shell-out (zero key).** If you have a Claude Pro/Max session (`claude login`) or a ChatGPT Plus + Codex CLI session (`codex login`) and want to drive simplicio from CI, scripts, or any context **outside** Claude Code, set `SIMPLICIO_MODEL=claude-cli/<model>` or `codex-cli/<model>`. simplicio spawns the CLI as a subprocess; the call rides your existing OAuth session — no API key required. A recursion guard (`SIMPLICIO_HOOK_GUARD=1`) is injected so the inner CLI does not re-fire simplicio's own hook.
352
+
353
+ **Path 3 is for environments without any logged-in CLI** — a remote server, a build runner, a notebook, a different LLM provider. You bring an API key (Anthropic, OpenRouter, OpenAI, GLM, DeepSeek, Ollama…), simplicio calls the provider directly.
354
+
355
+ ### Path 1 example — inside Claude Code
356
+
357
+ After `pip install simplicio-cli && simplicio smoke` (which triggers auto-bootstrap), just type your task in Claude Code:
358
+
359
+ ```
360
+ hide the Delete button for non-admins in src/app/screen/screen.component.html
361
+ ```
362
+
363
+ Claude Code sees the skill (semantic match) and the hook hint (`[SIMPLICIO_PROMPT_HINT]` on stderr — deterministic classifier). It runs simplicio's 6-layer contract under the hood. You see the diff + tests + verification — same as before, just dramatically more accurate.
364
+
365
+ ### Path 2 example — subscription shell-out, zero key
366
+
367
+ You already pay for Claude Pro/Max or ChatGPT Plus + Codex CLI. simplicio
368
+ piggybacks on that login — no extra bill, no key to manage.
369
+
370
+ ```bash
371
+ # Option A — Claude Code subscription (run `claude login` once)
372
+ export SIMPLICIO_MODEL=claude-cli/sonnet # or claude-cli/opus, claude-cli/default
373
+ unset SIMPLICIO_API_KEY # explicitly: no key needed
374
+
375
+ simplicio task "hide Delete button for non-admins" --stack angular \
376
+ --target src/app/screen/screen.component.html
377
+
378
+ # Option B — Codex CLI subscription (run `codex login` once)
379
+ export SIMPLICIO_MODEL=codex-cli/gpt-5 # or codex-cli/default
380
+ simplicio task "..." --stack angular --target ...
381
+ ```
382
+
383
+ How it works: simplicio shells out to `claude -p "<prompt>"` (or `codex exec "<prompt>"`) as a subprocess, captures stdout, runs the test loop. The inner CLI authenticates via your existing OAuth session in `~/.claude/` or `~/.codex/`. simplicio sets `SIMPLICIO_HOOK_GUARD=1` in the subprocess env so the inner Claude Code session does **not** re-fire simplicio's own UserPromptSubmit hook (no infinite recursion).
384
+
385
+ ### Path 3 example — standalone with API key
386
+
387
+ ```bash
388
+ export SIMPLICIO_API_KEY=sk-or-v1-… # OpenRouter key
389
+ export SIMPLICIO_MODEL=anthropic/claude-opus-4
390
+ export SIMPLICIO_BASE_URL=https://openrouter.ai/api/v1
391
+
392
+ simplicio index --stack angular # one-time, builds embedding cache
393
+ simplicio task "hide Delete button for non-admins" \
394
+ --stack angular \
395
+ --target src/app/screen/screen.component.html \
396
+ --criteria "- no admin perm: button absent from DOM
397
+ - with admin perm: button present" \
398
+ --constraints "- don't touch save flow
399
+ - build passes"
400
+ ```
401
+
402
+ Provider-agnostic — see [Configure](#configure--any-llm-nothing-hardcoded) for the full matrix.
403
+
404
+ ---
405
+
406
+ ### Path 1 deep-dive — auto-activation in Claude Code
407
+
408
+ `pip install` puts `simplicio` on your PATH. To make Claude Code
409
+ **automatically** route code-edit tasks through simplicio, a skill + hook
410
+ need to land in `~/.claude/`.
411
+
412
+ **Zero-step path (recommended).** The first time you run *any* `simplicio`
413
+ command after install, if Claude Code is present (`~/.claude/` exists) and
414
+ the hook is missing, simplicio installs both for you and prints one stderr
415
+ line. PEP 517 wheels can't execute code on `pip install`, so this is the
416
+ closest equivalent that works on every machine.
417
+
418
+ ```bash
419
+ pip install simplicio-cli
420
+ simplicio smoke # ← first call also installs skill + hook (idempotent)
421
+ # stderr: "simplicio: auto-activation installed in Claude Code …"
422
+ ```
423
+
424
+ Opt out before the first call:
425
+
426
+ ```bash
427
+ export SIMPLICIO_SKIP_AUTO_INIT=1
428
+ ```
429
+
430
+ **Explicit path.** Same effect, no auto-magic:
431
+
432
+ ```bash
433
+ simplicio init # idempotent
434
+ simplicio init --dry-run # preview only
435
+ simplicio init --claude-home <path> # override target dir
436
+ ```
437
+
438
+ Either way, two files land in `~/.claude/`:
439
+
440
+ | File | Purpose |
441
+ |---|---|
442
+ | `~/.claude/skills/simplicio-cli/SKILL.md` | Skill the agent matches by description when your prompt looks like a code edit |
443
+ | `~/.claude/hooks/simplicio-userpromptsubmit.sh` + entry in `~/.claude/settings.json` | UserPromptSubmit hook that runs `simplicio detect` on every prompt and injects a hint when the heuristic catches a code-edit task the skill could miss |
444
+
445
+ A backup of your previous `settings.json` is written to `settings.json.bak`
446
+ before any merge.
447
+
448
+ ### How it works at runtime
449
+
450
+ After install, every prompt you type in Claude Code flows through two layers:
451
+
452
+ 1. **Skill layer (semantic).** Claude reads the SKILL.md description. When
453
+ your prompt looks like a programming task ("add X to Y.tsx", "fix the auth
454
+ bug in middleware.py"), Claude considers using `simplicio task` instead of
455
+ writing code directly.
456
+ 2. **Hook layer (deterministic).** Every prompt fires `simplicio detect` via
457
+ the UserPromptSubmit hook. The classifier scores the prompt (verbs + file
458
+ extensions + code nouns − read-only cues). Score ≥ 3 → it emits a
459
+ `[SIMPLICIO_PROMPT_HINT]` block on stderr. Claude sees the hint alongside
460
+ your prompt — a hard nudge toward `simplicio task <prompt> <repo>`.
461
+
462
+ The layers are complementary. Skill = "Claude *might* pick simplicio". Hook
463
+ = "Claude *sees* the hint regardless".
464
+
465
+ ### Why UserPromptSubmit and not PreToolUse
466
+
467
+ UserPromptSubmit fires **once, before Claude decides which tool to call** —
468
+ exactly when we want to steer. PreToolUse fires *after* the decision is made,
469
+ and again for every tool call in the turn, with no access to the original
470
+ user prompt. UserPromptSubmit is the right pre-hook for routing decisions.
471
+
472
+ ### Disable / re-enable
473
+
474
+ | Goal | How |
475
+ |---|---|
476
+ | Block the auto-bootstrap | `export SIMPLICIO_SKIP_AUTO_INIT=1` before the first `simplicio` call |
477
+ | Disable hook permanently | Delete `~/.claude/hooks/simplicio-userpromptsubmit.sh` and its entry in `~/.claude/settings.json` |
478
+ | Re-install / repair | `simplicio init` (idempotent — won't double-write) |
479
+ | Preview without writing | `simplicio init --dry-run` |
480
+ | Skill-only (no hook) | Copy `.skills/simplicio-cli/SKILL.md` to `~/.claude/skills/simplicio-cli/SKILL.md` manually, skip `simplicio init` |
481
+
482
+ ---
483
+
484
+ ## Configure — any LLM, nothing hardcoded
485
+
486
+ > Applies to **Path 2** (standalone CLI). Path 1 users can skip this entire
487
+ > section — Claude Code handles the LLM call with the model and key already
488
+ > tied to your subscription.
489
+
490
+ | Provider | SIMPLICIO_MODEL | SIMPLICIO_BASE_URL |
491
+ |---|---|---|
492
+ | OpenRouter | `anthropic/claude-opus-4` | `https://openrouter.ai/api/v1` |
493
+ | GLM (z.ai) | `glm-4.6` | `https://api.z.ai/api/paas/v4` |
494
+ | DeepSeek | `deepseek-chat` | `https://api.deepseek.com` |
495
+ | OpenAI | `gpt-4.1` | `https://api.openai.com/v1` |
496
+ | Local (Ollama) | `llama3` | `http://localhost:11434/v1` |
497
+ | Anthropic native | `claude-opus-4-7` | *(leave unset)* |
498
+
499
+ If `SIMPLICIO_BASE_URL` is unset and the key is `ANTHROPIC_API_KEY`, it uses the
500
+ native Anthropic SDK. Otherwise it uses an OpenAI-compatible client pointed at
501
+ your `base_url` — so **any** OpenAI-like provider works without code changes.
502
+
503
+ ```bash
504
+ simplicio smoke # prints provider config + one test call
505
+ ```
506
+
507
+ ### The pipeline (both paths)
508
+
509
+ Whichever entry point you use, each task runs through the same engine:
510
+
511
+ ```
512
+ precedent (from cache)
513
+ → skill match
514
+ → 6-layer prompt
515
+ → LLM generates diff + test + Playwright
516
+ → apply diff
517
+ → run SIMPLICIO_TEST_CMD
518
+ → pass? done : send the error back → fix → retry (up to 3x)
519
+ ```
520
+
521
+ The 6-layer contract is what moves pass-rate from 41% to 99% on frontier
522
+ models (see [the numbers](#why-it-works--the-numbers) above). The retry loop
523
+ is what catches the remaining edge cases — measured separately in the
524
+ [4-quadrant bench](#4-quadrant-bench--agent--simplicio-matrix).
525
+
526
+ ### Common questions
527
+
528
+ **"I have a Claude Pro subscription but no API key — does this work?"** Yes,
529
+ on Path 1. Install simplicio-cli, open Claude Code, type your task as normal.
530
+ Claude Code makes the LLM call with your subscription; simplicio shapes the
531
+ prompt. No key needed.
532
+
533
+ **"I want to run it in CI / a script / outside Claude Code."** Path 2. Get an
534
+ API key from any of the providers above (OpenRouter is the cheapest way to
535
+ try multiple models behind one key), set `SIMPLICIO_API_KEY` +
536
+ `SIMPLICIO_MODEL` + optional `SIMPLICIO_BASE_URL`, run `simplicio task ...`.
537
+
538
+ **"I have Codex CLI / ChatGPT Plus and don't want to pay for an API key."**
539
+ Not auto-wired yet. Workarounds: (a) get an OpenRouter key (~$2 covers
540
+ thousands of tasks at small-model rates), (b) wait for the shell-out provider
541
+ that pipes through `claude -p` / `codex exec` using your subscription —
542
+ tracked, not shipped.
543
+
544
+ **"Will Claude Code use simplicio for *every* prompt now?"** No. The skill
545
+ only triggers on prompts that look like code edits (the description is
546
+ specific). The hook fires `simplicio detect` on every prompt but only emits
547
+ a hint when the deterministic classifier scores ≥ 3 (verbs + file extensions
548
+ + code nouns − read-only cues). "What does this function do?" gets no
549
+ nudge. "Add a delete confirmation to UserList.tsx" does.
550
+
551
+ **"How do I turn it off?"** See [Disable / re-enable](#disable--re-enable)
552
+ above. Two ways: env var (`SIMPLICIO_SKIP_AUTO_INIT=1` before first call) or
553
+ delete the hook entry from `~/.claude/settings.json`.
554
+
555
+ ---
556
+
557
+ ## Cache — why it doesn't re-map every time
558
+
559
+ Embeddings are keyed by **content hash**, stored in `.simplicio/`. Unchanged
560
+ code block → vector reused. Change one file → only that block re-embeds.
561
+
562
+ | Run | Blocks embedded | Time |
563
+ |---|---|---|
564
+ | 1st (cold cache) | 3 | ~baseline |
565
+ | 2nd (no change) | **0** | **~instant** |
566
+ | after editing 1 file | **1** | partial |
567
+
568
+ ---
569
+
570
+ ## Benchmark — reproduce in 30 seconds
571
+
572
+ ```bash
573
+ OPENROUTER_API_KEY=… \
574
+ BENCH_MODELS="deepseek/deepseek-v4-pro,qwen/qwen3.7-max,moonshotai/kimi-k2.6,openai/gpt-5.5,anthropic/claude-opus-4.7,google/gemini-3.5-flash" \
575
+ python3 bench/run_offline.py
576
+ ```
577
+
578
+ No project required, stdlib only, deterministic regex scoring — no LLM judges
579
+ the LLM. Each case runs twice on the **same** model: raw one-line objective vs
580
+ simplicio's 6-layer contract. Outputs scored on target-file mention, DIFF
581
+ block, TEST block, contract-state words. Full numbers in [`bench/results.md`](bench/results.md).
582
+
583
+ ### Full harness (your real project, your real tests)
584
+
585
+ ```bash
586
+ simplicio bench --cases bench/cases.json --stack angular
587
+ ```
588
+
589
+ Runs each case two ways and runs **your real test command** (e.g. `ng test
590
+ --watch=false`) on each output. Writes the true pass-rate to
591
+ [`bench/results.md`](bench/results.md).
592
+
593
+ ### 4-quadrant bench — agent × simplicio matrix
594
+
595
+ Adds the second axis: not just *"does the 6-layer wrap help one call?"* but
596
+ *"does it still help inside a retry loop?"*. Same model, same cases — only
597
+ the cell logic changes.
598
+
599
+ | | **no simplicio** | **with simplicio** |
600
+ | ----------------------- | ------------------------ | ------------------------ |
601
+ | **no agent** (1 call) | Q1 — baseline | Q2 — current bench |
602
+ | **with agent** (loop) | Q3 — loop only | Q4 — composition |
603
+
604
+ ```bash
605
+ pip install -e ".[bench]" # adds fpdf2 for PDF report
606
+ OPENROUTER_API_KEY=… \
607
+ BENCH_MODELS="google/gemma-3-4b-it" \
608
+ BENCH_MAX_ITERS=3 \
609
+ python3 bench/run_4quadrant.py
610
+ ```
611
+
612
+ Outputs `bench/results_4quadrant.{md,pdf,json}` + SVG charts under
613
+ `bench/charts/4q_*.svg` + per-iteration raw outputs under
614
+ `.simplicio/bench_4q/<model>/case_NN/q*_iter*.txt`. Methodology and
615
+ hypothesis decomposition: [`docs/benchmark-4quadrant.md`](docs/benchmark-4quadrant.md).
616
+
617
+ The matrix decomposes:
618
+
619
+ - **Prompt effect alone**: Q2 − Q1
620
+ - **Loop effect alone**: Q3 − Q1
621
+ - **Prompt effect inside loop**: Q4 − Q3 (does simplicio still matter once you loop?)
622
+ - **Composition gain over best single axis**: Q4 − max(Q2, Q3)
623
+ - **Synergy vs linear stacking**: Q4 − (Q1 + (Q2−Q1) + (Q3−Q1))
624
+
625
+ #### Run 1 — focused single-model, `google/gemma-3-4b-it`, 5 cases, max_iters=3 (2026-05-26)
626
+
627
+ | Quadrant | Prompt | Execution | Pass rate | Avg iters | Tokens / pass |
628
+ |---|---|---|---|---|---|
629
+ | **Q1** | raw goal | 1-shot | **0/5 (0%)** | 1.00 | 4,683 |
630
+ | **Q2** | simplicio 6-layer | 1-shot | **3/5 (60%)** | 1.00 | 800 |
631
+ | **Q3** | raw goal | loop w/ feedback | **2/5 (40%)** | 3.00 | 3,135 |
632
+ | **Q4** | simplicio 6-layer | loop w/ feedback | **4/5 (80%)** | 1.80 | 1,018 |
633
+
634
+ Decomposition (rejection threshold `|Δ| ≥ 5 pts`):
635
+
636
+ | Hypothesis | Δ | Verdict |
637
+ |---|---|---|
638
+ | Loop alone closes the gap (simplicio unnecessary once you loop) | Q4 − Q3 = **+40 pts** | **rejected** |
639
+ | Simplicio alone is enough (loop is overkill) | Q4 − Q2 = **+20 pts** | **rejected** |
640
+ | Gains stack linearly (no synergy) | Q4 − linear = **−20 pts** | **rejected** |
641
+
642
+ Cost per passing case: Q1 = 4,683 tok / 236s — Q2 = **800 tok / 21s** — Q3 = 3,135 tok / 109s — Q4 = **1,018 tok / 20s**. Full table + charts in [`bench/results_4quadrant.md`](bench/results_4quadrant.md).
643
+
644
+ #### Run 2 — wider multi-model, 3 models × 10 cases (partial), max_iters=5 (2026-05-26)
645
+
646
+ Replicated the matrix across more models and more cases. `qwen-2.5-7b` covers only the first 5 of 10 cases (wide run was killed mid-execution); `claude-3.5-haiku` not reached. Aggregate counts every observed `(model × case × quadrant)` tuple as one observation:
647
+
648
+ | Quadrant | Prompt | Execution | Pass rate | Avg iters | Tokens / pass | ms / pass |
649
+ |---|---|---|---|---|---|---|
650
+ | **Q1** | raw goal | 1-shot | **0/25 (0%)** | 1.00 | 22,387 | 817,437 |
651
+ | **Q2** | simplicio 6-layer | 1-shot | **16/25 (64%)** | 1.00 | 1,093 | 14,797 |
652
+ | **Q3** | raw goal | loop w/ feedback | **11/25 (44%)** | 4.00 | 7,154 | 106,382 |
653
+ | **Q4** | simplicio 6-layer | loop w/ feedback | **19/25 (76%)** | 2.44 | 1,914 | 24,170 |
654
+
655
+ Per-model breakdown:
656
+
657
+ | Model | Cases | Q1 | Q2 | Q3 | Q4 |
658
+ |---|---|---|---|---|---|
659
+ | `google/gemma-3-4b-it` | 10/10 | 0/10 (0%) | 7/10 (70%) | 4/10 (40%) | **8/10 (80%)** |
660
+ | `meta-llama/llama-3.2-3b-instruct` | 10/10 | 0/10 (0%) | 5/10 (50%) | 4/10 (40%) | **6/10 (60%)** |
661
+ | `qwen/qwen-2.5-7b-instruct` | 5/10 | 0/5 (0%) | 4/5 (80%) | 3/5 (60%) | **5/5 (100%)** |
662
+
663
+ Decomposition (rejection threshold `|Δ| ≥ 5 pts`):
664
+
665
+ | Hypothesis | Δ | Verdict |
666
+ |---|---|---|
667
+ | Loop alone closes the gap (simplicio unnecessary once you loop) | Q4 − Q3 = **+32 pts** | **rejected** |
668
+ | Simplicio alone is enough (loop is overkill) | Q4 − Q2 = **+12 pts** | **rejected** |
669
+ | Gains stack linearly (no synergy) | Q4 − linear = **−32 pts** | **rejected** |
670
+
671
+ Same picture at every scale: Q4 (composition) wins on pass-rate, **and** Q4 stays close to Q2 on cost (1.9k tok / 24s per pass vs. Q2's 1.1k / 15s) while Q3 burns 7.2k tok / 106s per pass for fewer passes. Full table + per-case breakdown in [`bench/results_4quadrant_wide.md`](bench/results_4quadrant_wide.md).
672
+
673
+ ---
674
+
675
+ ## Plug points (stubs marked in code)
676
+
677
+ | File | Replace with |
678
+ |---|---|
679
+ | `prompt.py::_mapper` | your real **llm-project-mapper** |
680
+ | `pipeline.py::_aplicar_e_testar` | extract diff → `git apply` → parse test result |
681
+ | `skill_router.py` | point `SIMPLICIO_SKILLS_DIR` at your mapper's skills |
682
+
683
+ ## Layout
684
+
685
+ ```
686
+ simplicio/
687
+ cli.py # index | task | bench | smoke
688
+ cache.py # content-hash embedding cache
689
+ precedent.py # grep + semantic rank (uses cache)
690
+ skill_router.py # picks the ONE matching skill
691
+ prompt.py # stacks the 6 layers
692
+ providers.py # any OpenAI-compatible endpoint + Anthropic native
693
+ pipeline.py # generate → test → fix loop
694
+ bench.py # with-vs-without harness
695
+ templates/simplicio_prompt.md
696
+ bench/
697
+ run_offline.py # stdlib-only multi-model benchmark
698
+ cases.json # your benchmark tasks
699
+ cases_offline.json
700
+ results.md # filled by `simplicio bench` / `run_offline.py`
701
+ charts/ # SVG: overall, delta, by_case, by_stack
702
+ ```
703
+
704
+ ## License
705
+ MIT