simplicio-cli 0.2.2__tar.gz → 0.2.9__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (31) hide show
  1. simplicio_cli-0.2.9/PKG-INFO +355 -0
  2. simplicio_cli-0.2.9/README.md +318 -0
  3. {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/pyproject.toml +4 -1
  4. simplicio_cli-0.2.9/simplicio/bench.py +51 -0
  5. {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio/cache.py +20 -20
  6. simplicio_cli-0.2.9/simplicio/cli.py +43 -0
  7. simplicio_cli-0.2.9/simplicio/pipeline.py +28 -0
  8. {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio/precedent.py +24 -24
  9. simplicio_cli-0.2.9/simplicio/prompt.py +25 -0
  10. simplicio_cli-0.2.9/simplicio/providers.py +63 -0
  11. simplicio_cli-0.2.9/simplicio/skill_router.py +48 -0
  12. simplicio_cli-0.2.9/simplicio/templates/simplicio_prompt.md +43 -0
  13. simplicio_cli-0.2.9/simplicio_cli.egg-info/PKG-INFO +355 -0
  14. {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio_cli.egg-info/requires.txt +3 -0
  15. simplicio_cli-0.2.2/PKG-INFO +0 -231
  16. simplicio_cli-0.2.2/README.md +0 -196
  17. simplicio_cli-0.2.2/simplicio/bench.py +0 -51
  18. simplicio_cli-0.2.2/simplicio/cli.py +0 -43
  19. simplicio_cli-0.2.2/simplicio/pipeline.py +0 -28
  20. simplicio_cli-0.2.2/simplicio/prompt.py +0 -25
  21. simplicio_cli-0.2.2/simplicio/providers.py +0 -62
  22. simplicio_cli-0.2.2/simplicio/skill_router.py +0 -48
  23. simplicio_cli-0.2.2/simplicio/templates/simplicio_prompt.md +0 -43
  24. simplicio_cli-0.2.2/simplicio_cli.egg-info/PKG-INFO +0 -231
  25. {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/LICENSE +0 -0
  26. {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/setup.cfg +0 -0
  27. {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio/__init__.py +0 -0
  28. {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio_cli.egg-info/SOURCES.txt +0 -0
  29. {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio_cli.egg-info/dependency_links.txt +0 -0
  30. {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio_cli.egg-info/entry_points.txt +0 -0
  31. {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio_cli.egg-info/top_level.txt +0 -0
@@ -0,0 +1,355 @@
1
+ Metadata-Version: 2.4
2
+ Name: simplicio-cli
3
+ Version: 0.2.9
4
+ Summary: Portable task-to-code pipeline that works with any LLM. Turn a one-line task into a verified code change — diff + test + verify loop. +55 pts on a 156-check benchmark, 21% faster, ~same tokens.
5
+ Author-email: Wesley Simplicio <wesleybob4@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/wesleysimplicio/simplicio-cli
8
+ Project-URL: Repository, https://github.com/wesleysimplicio/simplicio-cli
9
+ Project-URL: Issues, https://github.com/wesleysimplicio/simplicio-cli/issues
10
+ Project-URL: Changelog, https://github.com/wesleysimplicio/simplicio-cli/releases
11
+ Keywords: llm,ai,agent,code-generation,prompt-engineering,openrouter,openai,anthropic,claude,developer-tools,cli,rag,embeddings,verify-loop,task-automation
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Environment :: Console
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3 :: Only
19
+ Classifier: Programming Language :: Python :: 3.9
20
+ Classifier: Programming Language :: Python :: 3.10
21
+ Classifier: Programming Language :: Python :: 3.11
22
+ Classifier: Programming Language :: Python :: 3.12
23
+ Classifier: Topic :: Software Development
24
+ Classifier: Topic :: Software Development :: Code Generators
25
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
26
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
27
+ Requires-Python: >=3.9
28
+ Description-Content-Type: text/markdown
29
+ License-File: LICENSE
30
+ Requires-Dist: sentence-transformers>=2.2
31
+ Requires-Dist: numpy>=1.23
32
+ Requires-Dist: anthropic>=0.30
33
+ Requires-Dist: openai>=1.30
34
+ Provides-Extra: bench
35
+ Requires-Dist: fpdf2>=2.7; extra == "bench"
36
+ Dynamic: license-file
37
+
38
+ # simplicio-cli
39
+
40
+ **Your tasks with 99% accuracy using any LLM (Claude, DeepSeek, Codex, Gemini, Hermes, OpenClaw, Cursor).**
41
+
42
+ [![PyPI](https://img.shields.io/pypi/v/simplicio-cli.svg)](https://pypi.org/project/simplicio-cli/)
43
+ [![Python](https://img.shields.io/pypi/pyversions/simplicio-cli.svg)](https://pypi.org/project/simplicio-cli/)
44
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
45
+
46
+ [![simplicio-cli pipeline hero: one-line task to verified code change](https://raw.githubusercontent.com/wesleysimplicio/simplicio-cli/master/output/imagegen/simplicio-cli-readme-hero-web.png)](output/imagegen/simplicio-cli-readme-hero.png)
47
+
48
+ > *"hide the Delete button for non-admins"* → diff + test + applied + verified.
49
+ > Works with **OpenRouter, OpenAI, Anthropic, GLM, DeepSeek, Ollama** — one env var.
50
+
51
+ ```bash
52
+ pip install simplicio-cli
53
+ ```
54
+
55
+ ---
56
+
57
+ ## Why it works — the numbers
58
+
59
+ Same model. Same task. Only the prompt changes. **Measured, reproducible, deterministic.**
60
+ **Fourteen models tested across three runs** — five sub-4B tiny models, six
61
+ frontier 2026 models, and three mid-tier 7B–12B open models. Every one gained
62
+ at least **+14 points** when wrapped in simplicio's 6-layer contract.
63
+
64
+ #### Tiny models — sub-4B, run on 2026-05-26 (50 runs/side, 260 checks)
65
+
66
+ | Model | Without simplicio | With simplicio | Gain |
67
+ |---|---|---|---|
68
+ | **Gemma 3 4B** (`google/gemma-3-4b-it`) | 38% | **96%** | **+58 pts** |
69
+ | **Llama 3.2 3B** (`meta-llama/llama-3.2-3b-instruct`) | 28% | **73%** | **+45 pts** |
70
+ | **Gemma 3n e4B** (`google/gemma-3n-e4b-it`) | 44% | **88%** | **+44 pts** |
71
+ | **Phi-4 mini** (`microsoft/phi-4-mini-instruct`) | 36% | **73%** | **+37 pts** |
72
+ | **Llama 3.2 1B** (`meta-llama/llama-3.2-1b-instruct`) | 26% | **40%** | **+14 pts** |
73
+ | **Tiny avg (5 models · 10 cases · 260 checks)** | **35%** | **74%** | **+39 pts (+112%)** |
74
+
75
+ > **Not hosted on OpenRouter** (requested but skipped): Gemma 3 270M, Gemma 3 1B,
76
+ > Gemma 2 2B, Qwen3 0.6B, Qwen3 1.7B, Qwen2.5 0.5B, Qwen2.5 1.5B, Qwen 3B,
77
+ > Nemotron Nano 4B (OR's smallest Nemotron is 9B). Sub-4B substitutes used above.
78
+ > simplicio still gains **+14 to +58 points** even on a 1B-param model.
79
+
80
+ #### Frontier 2026 models — run on 2026-05-26 (60 runs/side, 312 checks)
81
+
82
+ | Model | Without simplicio | With simplicio | Gain |
83
+ |---|---|---|---|
84
+ | **GPT-5.5** (`openai/gpt-5.5`) | 38% | **100%** | **+62 pts** |
85
+ | **Kimi K2.6** (`moonshotai/kimi-k2.6`) | 40% | **100%** | **+60 pts** |
86
+ | **Gemini 3.5 Flash** (`google/gemini-3.5-flash`) | 42% | **100%** | **+58 pts** |
87
+ | **Qwen 3.7 Max** (`qwen/qwen3.7-max`) | 44% | **100%** | **+56 pts** |
88
+ | **Claude Opus 4.7** (`anthropic/claude-opus-4.7`) | 42% | **98%** | **+56 pts** |
89
+ | **DeepSeek V4 Pro** (`deepseek/deepseek-v4-pro`) | 44% | **96%** | **+52 pts** |
90
+ | **Frontier avg (6 models · 10 cases · 312 checks)** | **41%** | **99%** | **+58 pts (+136%)** |
91
+
92
+ #### Mid-tier 7B–12B open models — earlier run (v0.2.2, 30 runs/side, 156 checks)
93
+
94
+ | Model | Without simplicio | With simplicio | Gain |
95
+ |---|---|---|---|
96
+ | **Gemma 3 12B** (`google/gemma-3-12b-it`) | 34% | **92%** | **+58 pts** |
97
+ | **Llama 3.1 8B** (`meta-llama/llama-3.1-8b-instruct`) | 36% | **90%** | **+54 pts** |
98
+ | **Qwen 2.5 7B** (`qwen/qwen-2.5-7b-instruct`) | 34% | **88%** | **+54 pts** |
99
+ | **Mid-tier avg (3 models · 10 cases · 156 checks)** | **35%** | **90%** | **+55 pts (+156%)** |
100
+
101
+ > **Across all 14 models tested across three runs**, the average gain is **+51
102
+ > points**. Smallest: **+14 pts** (Llama 3.2 1B — the contract still moves a
103
+ > 1B-param model). Largest: **+62 pts** (GPT-5.5). The contract helps tiny
104
+ > sub-4B models, frontier reasoning models, and mid-tier 7B–12B alike — five
105
+ > of the six frontier models hit **100% pass-rate**.
106
+
107
+ ### Output-quality signals (rate across all 60 frontier runs)
108
+
109
+ | Signal | Raw prompt | With simplicio |
110
+ |---|---|---|
111
+ | **DIFF block present** | 36% | **98%** |
112
+ | Target file mentioned | 1% | **100%** |
113
+ | TEST block present | 88% | **98%** |
114
+
115
+ ### Cost — tokens & wall-clock (measured, not estimated)
116
+
117
+ Same provider, same models, same cases. Token counts pulled from the API
118
+ `usage` field; latency from `time.perf_counter()` around each call.
119
+
120
+ | Side | Tokens / run | Wall-clock / run | Total tokens (60 runs) | Total time |
121
+ |---|---|---|---|---|
122
+ | Raw prompt | 1,967 | 46.1s | 118,040 | 46m 07s |
123
+ | With simplicio | **3,168** | **57.6s** | **190,119** | **57m 33s** |
124
+ | Δ | **+61%** | **+24%** | +72,079 | +11m 26s |
125
+
126
+ simplicio wraps the objective in a 6-layer contract — more input tokens up
127
+ front, longer completions because the model produces the full DIFF + TEST +
128
+ EVIDENCE the contract demands instead of a one-line guess. The bill goes up,
129
+ but so does the **pass-rate (41% → 99%)** and the **DIFF-block rate (36% → 98%)** —
130
+ useful tokens, not chat.
131
+
132
+ > Six frontier models — GPT-5.5, Kimi K2.6, Gemini 3.5 Flash, Qwen 3.7 Max,
133
+ > Claude Opus 4.7, DeepSeek V4 Pro — gained **+52 to +62 points** when wrapped
134
+ > in simplicio's 6-layer contract. Without changing the model. Without
135
+ > fine-tuning. Five of six landed at **100% pass-rate with simplicio**.
136
+
137
+ Full report: [`bench/results.md`](bench/results.md) · [`bench/results.pdf`](bench/results.pdf) · raw outputs under `.simplicio/bench_runs/`.
138
+
139
+ ---
140
+
141
+ ## How it works
142
+
143
+ ```
144
+ mapper WHERE project structure + latest state
145
+ precedent HOW-1 the real snippet in THIS repo that already does it
146
+ skill-router HOW-2 the ONE mapper skill that matches (ranked, not all)
147
+ simplicio BUILD stacks the 6 layers into one prompt (cache-friendly)
148
+ test JUDGE contract written as testable states
149
+ verify PROOF ran it — did it actually pass? loop-fix up to 3x
150
+ ```
151
+
152
+ **The idea in one line: don't ask the model to guess — hand it the path.**
153
+ Each layer terminates one decision the model would otherwise hallucinate.
154
+ Relevant > complete — inject the *right* context, never *all* of it.
155
+
156
+ ---
157
+
158
+ ## Install
159
+
160
+ ```bash
161
+ pip install simplicio-cli # from PyPI
162
+ # or
163
+ pip install -e . # from this repo
164
+ ```
165
+
166
+ ## Configure — any LLM, nothing hardcoded
167
+
168
+ | Provider | SIMPLICIO_MODEL | SIMPLICIO_BASE_URL |
169
+ |---|---|---|
170
+ | OpenRouter | `anthropic/claude-opus-4` | `https://openrouter.ai/api/v1` |
171
+ | GLM (z.ai) | `glm-4.6` | `https://api.z.ai/api/paas/v4` |
172
+ | DeepSeek | `deepseek-chat` | `https://api.deepseek.com` |
173
+ | OpenAI | `gpt-4.1` | `https://api.openai.com/v1` |
174
+ | Local (Ollama) | `llama3` | `http://localhost:11434/v1` |
175
+ | Anthropic native | `claude-opus-4-7` | *(leave unset)* |
176
+
177
+ If `SIMPLICIO_BASE_URL` is unset and the key is `ANTHROPIC_API_KEY`, it uses the
178
+ native Anthropic SDK. Otherwise it uses an OpenAI-compatible client pointed at
179
+ your `base_url` — so **any** OpenAI-like provider works without code changes.
180
+
181
+ ```bash
182
+ simplicio smoke # prints provider config + one test call
183
+ ```
184
+
185
+ ## Use
186
+
187
+ ```bash
188
+ # index once (caches embeddings; re-run after big changes)
189
+ simplicio index --stack angular
190
+
191
+ # run a task
192
+ simplicio task "hide Delete button for non-admins" \
193
+ --stack angular \
194
+ --target src/app/screen/screen.component.html \
195
+ --criteria "- no admin perm: button absent from DOM
196
+ - with admin perm: button present" \
197
+ --constraints "- don't touch save flow
198
+ - build passes"
199
+ ```
200
+
201
+ Each `task`: precedent (from cache) → skill match → 6 layers → LLM generates
202
+ (diff + test + Playwright) → apply → run `SIMPLICIO_TEST_CMD` → pass? **done** :
203
+ send the error back → fix → retry (up to 3x).
204
+
205
+ ---
206
+
207
+ ## Cache — why it doesn't re-map every time
208
+
209
+ Embeddings are keyed by **content hash**, stored in `.simplicio/`. Unchanged
210
+ code block → vector reused. Change one file → only that block re-embeds.
211
+
212
+ | Run | Blocks embedded | Time |
213
+ |---|---|---|
214
+ | 1st (cold cache) | 3 | ~baseline |
215
+ | 2nd (no change) | **0** | **~instant** |
216
+ | after editing 1 file | **1** | partial |
217
+
218
+ ---
219
+
220
+ ## Benchmark — reproduce in 30 seconds
221
+
222
+ ```bash
223
+ OPENROUTER_API_KEY=… \
224
+ BENCH_MODELS="deepseek/deepseek-v4-pro,qwen/qwen3.7-max,moonshotai/kimi-k2.6,openai/gpt-5.5,anthropic/claude-opus-4.7,google/gemini-3.5-flash" \
225
+ python3 bench/run_offline.py
226
+ ```
227
+
228
+ No project required, stdlib only, deterministic regex scoring — no LLM judges
229
+ the LLM. Each case runs twice on the **same** model: raw one-line objective vs
230
+ simplicio's 6-layer contract. Outputs scored on target-file mention, DIFF
231
+ block, TEST block, contract-state words. Full numbers in [`bench/results.md`](bench/results.md).
232
+
233
+ ### Full harness (your real project, your real tests)
234
+
235
+ ```bash
236
+ simplicio bench --cases bench/cases.json --stack angular
237
+ ```
238
+
239
+ Runs each case two ways and runs **your real test command** (e.g. `ng test
240
+ --watch=false`) on each output. Writes the true pass-rate to
241
+ [`bench/results.md`](bench/results.md).
242
+
243
+ ### 4-quadrant bench — agent × simplicio matrix
244
+
245
+ Adds the second axis: not just *"does the 6-layer wrap help one call?"* but
246
+ *"does it still help inside a retry loop?"*. Same model, same cases — only
247
+ the cell logic changes.
248
+
249
+ | | **no simplicio** | **with simplicio** |
250
+ | ----------------------- | ------------------------ | ------------------------ |
251
+ | **no agent** (1 call) | Q1 — baseline | Q2 — current bench |
252
+ | **with agent** (loop) | Q3 — loop only | Q4 — composition |
253
+
254
+ ```bash
255
+ pip install -e ".[bench]" # adds fpdf2 for PDF report
256
+ OPENROUTER_API_KEY=… \
257
+ BENCH_MODELS="google/gemma-3-4b-it" \
258
+ BENCH_MAX_ITERS=3 \
259
+ python3 bench/run_4quadrant.py
260
+ ```
261
+
262
+ Outputs `bench/results_4quadrant.{md,pdf,json}` + SVG charts under
263
+ `bench/charts/4q_*.svg` + per-iteration raw outputs under
264
+ `.simplicio/bench_4q/<model>/case_NN/q*_iter*.txt`. Methodology and
265
+ hypothesis decomposition: [`docs/benchmark-4quadrant.md`](docs/benchmark-4quadrant.md).
266
+
267
+ The matrix decomposes:
268
+
269
+ - **Prompt effect alone**: Q2 − Q1
270
+ - **Loop effect alone**: Q3 − Q1
271
+ - **Prompt effect inside loop**: Q4 − Q3 (does simplicio still matter once you loop?)
272
+ - **Composition gain over best single axis**: Q4 − max(Q2, Q3)
273
+ - **Synergy vs linear stacking**: Q4 − (Q1 + (Q2−Q1) + (Q3−Q1))
274
+
275
+ #### Run 1 — focused single-model, `google/gemma-3-4b-it`, 5 cases, max_iters=3 (2026-05-26)
276
+
277
+ | Quadrant | Prompt | Execution | Pass rate | Avg iters | Tokens / pass |
278
+ |---|---|---|---|---|---|
279
+ | **Q1** | raw goal | 1-shot | **0/5 (0%)** | 1.00 | 4,683 |
280
+ | **Q2** | simplicio 6-layer | 1-shot | **3/5 (60%)** | 1.00 | 800 |
281
+ | **Q3** | raw goal | loop w/ feedback | **2/5 (40%)** | 3.00 | 3,135 |
282
+ | **Q4** | simplicio 6-layer | loop w/ feedback | **4/5 (80%)** | 1.80 | 1,018 |
283
+
284
+ Decomposition (rejection threshold `|Δ| ≥ 5 pts`):
285
+
286
+ | Hypothesis | Δ | Verdict |
287
+ |---|---|---|
288
+ | Loop alone closes the gap (simplicio unnecessary once you loop) | Q4 − Q3 = **+40 pts** | **rejected** |
289
+ | Simplicio alone is enough (loop is overkill) | Q4 − Q2 = **+20 pts** | **rejected** |
290
+ | Gains stack linearly (no synergy) | Q4 − linear = **−20 pts** | **rejected** |
291
+
292
+ Cost per passing case: Q1 = 4,683 tok / 236s — Q2 = **800 tok / 21s** — Q3 = 3,135 tok / 109s — Q4 = **1,018 tok / 20s**. Full table + charts in [`bench/results_4quadrant.md`](bench/results_4quadrant.md).
293
+
294
+ #### Run 2 — wider multi-model, 3 models × 10 cases (partial), max_iters=5 (2026-05-26)
295
+
296
+ Replicated the matrix across more models and more cases. `qwen-2.5-7b` covers only the first 5 of 10 cases (wide run was killed mid-execution); `claude-3.5-haiku` not reached. Aggregate counts every observed `(model × case × quadrant)` tuple as one observation:
297
+
298
+ | Quadrant | Prompt | Execution | Pass rate | Avg iters | Tokens / pass | ms / pass |
299
+ |---|---|---|---|---|---|---|
300
+ | **Q1** | raw goal | 1-shot | **0/25 (0%)** | 1.00 | 22,387 | 817,437 |
301
+ | **Q2** | simplicio 6-layer | 1-shot | **16/25 (64%)** | 1.00 | 1,093 | 14,797 |
302
+ | **Q3** | raw goal | loop w/ feedback | **11/25 (44%)** | 4.00 | 7,154 | 106,382 |
303
+ | **Q4** | simplicio 6-layer | loop w/ feedback | **19/25 (76%)** | 2.44 | 1,914 | 24,170 |
304
+
305
+ Per-model breakdown:
306
+
307
+ | Model | Cases | Q1 | Q2 | Q3 | Q4 |
308
+ |---|---|---|---|---|---|
309
+ | `google/gemma-3-4b-it` | 10/10 | 0/10 (0%) | 7/10 (70%) | 4/10 (40%) | **8/10 (80%)** |
310
+ | `meta-llama/llama-3.2-3b-instruct` | 10/10 | 0/10 (0%) | 5/10 (50%) | 4/10 (40%) | **6/10 (60%)** |
311
+ | `qwen/qwen-2.5-7b-instruct` | 5/10 | 0/5 (0%) | 4/5 (80%) | 3/5 (60%) | **5/5 (100%)** |
312
+
313
+ Decomposition (rejection threshold `|Δ| ≥ 5 pts`):
314
+
315
+ | Hypothesis | Δ | Verdict |
316
+ |---|---|---|
317
+ | Loop alone closes the gap (simplicio unnecessary once you loop) | Q4 − Q3 = **+32 pts** | **rejected** |
318
+ | Simplicio alone is enough (loop is overkill) | Q4 − Q2 = **+12 pts** | **rejected** |
319
+ | Gains stack linearly (no synergy) | Q4 − linear = **−32 pts** | **rejected** |
320
+
321
+ Same picture at every scale: Q4 (composition) wins on pass-rate, **and** Q4 stays close to Q2 on cost (1.9k tok / 24s per pass vs. Q2's 1.1k / 15s) while Q3 burns 7.2k tok / 106s per pass for fewer passes. Full table + per-case breakdown in [`bench/results_4quadrant_wide.md`](bench/results_4quadrant_wide.md).
322
+
323
+ ---
324
+
325
+ ## Plug points (stubs marked in code)
326
+
327
+ | File | Replace with |
328
+ |---|---|
329
+ | `prompt.py::_mapper` | your real **llm-project-mapper** |
330
+ | `pipeline.py::_aplicar_e_testar` | extract diff → `git apply` → parse test result |
331
+ | `skill_router.py` | point `SIMPLICIO_SKILLS_DIR` at your mapper's skills |
332
+
333
+ ## Layout
334
+
335
+ ```
336
+ simplicio/
337
+ cli.py # index | task | bench | smoke
338
+ cache.py # content-hash embedding cache
339
+ precedent.py # grep + semantic rank (uses cache)
340
+ skill_router.py # picks the ONE matching skill
341
+ prompt.py # stacks the 6 layers
342
+ providers.py # any OpenAI-compatible endpoint + Anthropic native
343
+ pipeline.py # generate → test → fix loop
344
+ bench.py # with-vs-without harness
345
+ templates/simplicio_prompt.md
346
+ bench/
347
+ run_offline.py # stdlib-only multi-model benchmark
348
+ cases.json # your benchmark tasks
349
+ cases_offline.json
350
+ results.md # filled by `simplicio bench` / `run_offline.py`
351
+ charts/ # SVG: overall, delta, by_case, by_stack
352
+ ```
353
+
354
+ ## License
355
+ MIT
@@ -0,0 +1,318 @@
1
+ # simplicio-cli
2
+
3
+ **Your tasks with 99% accuracy using any LLM (Claude, DeepSeek, Codex, Gemini, Hermes, OpenClaw, Cursor).**
4
+
5
+ [![PyPI](https://img.shields.io/pypi/v/simplicio-cli.svg)](https://pypi.org/project/simplicio-cli/)
6
+ [![Python](https://img.shields.io/pypi/pyversions/simplicio-cli.svg)](https://pypi.org/project/simplicio-cli/)
7
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
8
+
9
+ [![simplicio-cli pipeline hero: one-line task to verified code change](https://raw.githubusercontent.com/wesleysimplicio/simplicio-cli/master/output/imagegen/simplicio-cli-readme-hero-web.png)](output/imagegen/simplicio-cli-readme-hero.png)
10
+
11
+ > *"hide the Delete button for non-admins"* → diff + test + applied + verified.
12
+ > Works with **OpenRouter, OpenAI, Anthropic, GLM, DeepSeek, Ollama** — one env var.
13
+
14
+ ```bash
15
+ pip install simplicio-cli
16
+ ```
17
+
18
+ ---
19
+
20
+ ## Why it works — the numbers
21
+
22
+ Same model. Same task. Only the prompt changes. **Measured, reproducible, deterministic.**
23
+ **Fourteen models tested across three runs** — five sub-4B tiny models, six
24
+ frontier 2026 models, and three mid-tier 7B–12B open models. Every one gained
25
+ at least **+14 points** when wrapped in simplicio's 6-layer contract.
26
+
27
+ #### Tiny models — sub-4B, run on 2026-05-26 (50 runs/side, 260 checks)
28
+
29
+ | Model | Without simplicio | With simplicio | Gain |
30
+ |---|---|---|---|
31
+ | **Gemma 3 4B** (`google/gemma-3-4b-it`) | 38% | **96%** | **+58 pts** |
32
+ | **Llama 3.2 3B** (`meta-llama/llama-3.2-3b-instruct`) | 28% | **73%** | **+45 pts** |
33
+ | **Gemma 3n e4B** (`google/gemma-3n-e4b-it`) | 44% | **88%** | **+44 pts** |
34
+ | **Phi-4 mini** (`microsoft/phi-4-mini-instruct`) | 36% | **73%** | **+37 pts** |
35
+ | **Llama 3.2 1B** (`meta-llama/llama-3.2-1b-instruct`) | 26% | **40%** | **+14 pts** |
36
+ | **Tiny avg (5 models · 10 cases · 260 checks)** | **35%** | **74%** | **+39 pts (+112%)** |
37
+
38
+ > **Not hosted on OpenRouter** (requested but skipped): Gemma 3 270M, Gemma 3 1B,
39
+ > Gemma 2 2B, Qwen3 0.6B, Qwen3 1.7B, Qwen2.5 0.5B, Qwen2.5 1.5B, Qwen 3B,
40
+ > Nemotron Nano 4B (OR's smallest Nemotron is 9B). Sub-4B substitutes used above.
41
+ > simplicio still gains **+14 to +58 points** even on a 1B-param model.
42
+
43
+ #### Frontier 2026 models — run on 2026-05-26 (60 runs/side, 312 checks)
44
+
45
+ | Model | Without simplicio | With simplicio | Gain |
46
+ |---|---|---|---|
47
+ | **GPT-5.5** (`openai/gpt-5.5`) | 38% | **100%** | **+62 pts** |
48
+ | **Kimi K2.6** (`moonshotai/kimi-k2.6`) | 40% | **100%** | **+60 pts** |
49
+ | **Gemini 3.5 Flash** (`google/gemini-3.5-flash`) | 42% | **100%** | **+58 pts** |
50
+ | **Qwen 3.7 Max** (`qwen/qwen3.7-max`) | 44% | **100%** | **+56 pts** |
51
+ | **Claude Opus 4.7** (`anthropic/claude-opus-4.7`) | 42% | **98%** | **+56 pts** |
52
+ | **DeepSeek V4 Pro** (`deepseek/deepseek-v4-pro`) | 44% | **96%** | **+52 pts** |
53
+ | **Frontier avg (6 models · 10 cases · 312 checks)** | **41%** | **99%** | **+58 pts (+136%)** |
54
+
55
+ #### Mid-tier 7B–12B open models — earlier run (v0.2.2, 30 runs/side, 156 checks)
56
+
57
+ | Model | Without simplicio | With simplicio | Gain |
58
+ |---|---|---|---|
59
+ | **Gemma 3 12B** (`google/gemma-3-12b-it`) | 34% | **92%** | **+58 pts** |
60
+ | **Llama 3.1 8B** (`meta-llama/llama-3.1-8b-instruct`) | 36% | **90%** | **+54 pts** |
61
+ | **Qwen 2.5 7B** (`qwen/qwen-2.5-7b-instruct`) | 34% | **88%** | **+54 pts** |
62
+ | **Mid-tier avg (3 models · 10 cases · 156 checks)** | **35%** | **90%** | **+55 pts (+156%)** |
63
+
64
+ > **Across all 14 models tested across three runs**, the average gain is **+51
65
+ > points**. Smallest: **+14 pts** (Llama 3.2 1B — the contract still moves a
66
+ > 1B-param model). Largest: **+62 pts** (GPT-5.5). The contract helps tiny
67
+ > sub-4B models, frontier reasoning models, and mid-tier 7B–12B alike — five
68
+ > of the six frontier models hit **100% pass-rate**.
69
+
70
+ ### Output-quality signals (rate across all 60 frontier runs)
71
+
72
+ | Signal | Raw prompt | With simplicio |
73
+ |---|---|---|
74
+ | **DIFF block present** | 36% | **98%** |
75
+ | Target file mentioned | 1% | **100%** |
76
+ | TEST block present | 88% | **98%** |
77
+
78
+ ### Cost — tokens & wall-clock (measured, not estimated)
79
+
80
+ Same provider, same models, same cases. Token counts pulled from the API
81
+ `usage` field; latency from `time.perf_counter()` around each call.
82
+
83
+ | Side | Tokens / run | Wall-clock / run | Total tokens (60 runs) | Total time |
84
+ |---|---|---|---|---|
85
+ | Raw prompt | 1,967 | 46.1s | 118,040 | 46m 07s |
86
+ | With simplicio | **3,168** | **57.6s** | **190,119** | **57m 33s** |
87
+ | Δ | **+61%** | **+24%** | +72,079 | +11m 26s |
88
+
89
+ simplicio wraps the objective in a 6-layer contract — more input tokens up
90
+ front, longer completions because the model produces the full DIFF + TEST +
91
+ EVIDENCE the contract demands instead of a one-line guess. The bill goes up,
92
+ but so does the **pass-rate (41% → 99%)** and the **DIFF-block rate (36% → 98%)** —
93
+ useful tokens, not chat.
94
+
95
+ > Six frontier models — GPT-5.5, Kimi K2.6, Gemini 3.5 Flash, Qwen 3.7 Max,
96
+ > Claude Opus 4.7, DeepSeek V4 Pro — gained **+52 to +62 points** when wrapped
97
+ > in simplicio's 6-layer contract. Without changing the model. Without
98
+ > fine-tuning. Five of six landed at **100% pass-rate with simplicio**.
99
+
100
+ Full report: [`bench/results.md`](bench/results.md) · [`bench/results.pdf`](bench/results.pdf) · raw outputs under `.simplicio/bench_runs/`.
101
+
102
+ ---
103
+
104
+ ## How it works
105
+
106
+ ```
107
+ mapper WHERE project structure + latest state
108
+ precedent HOW-1 the real snippet in THIS repo that already does it
109
+ skill-router HOW-2 the ONE mapper skill that matches (ranked, not all)
110
+ simplicio BUILD stacks the 6 layers into one prompt (cache-friendly)
111
+ test JUDGE contract written as testable states
112
+ verify PROOF ran it — did it actually pass? loop-fix up to 3x
113
+ ```
114
+
115
+ **The idea in one line: don't ask the model to guess — hand it the path.**
116
+ Each layer terminates one decision the model would otherwise hallucinate.
117
+ Relevant > complete — inject the *right* context, never *all* of it.
118
+
119
+ ---
120
+
121
+ ## Install
122
+
123
+ ```bash
124
+ pip install simplicio-cli # from PyPI
125
+ # or
126
+ pip install -e . # from this repo
127
+ ```
128
+
129
+ ## Configure — any LLM, nothing hardcoded
130
+
131
+ | Provider | SIMPLICIO_MODEL | SIMPLICIO_BASE_URL |
132
+ |---|---|---|
133
+ | OpenRouter | `anthropic/claude-opus-4` | `https://openrouter.ai/api/v1` |
134
+ | GLM (z.ai) | `glm-4.6` | `https://api.z.ai/api/paas/v4` |
135
+ | DeepSeek | `deepseek-chat` | `https://api.deepseek.com` |
136
+ | OpenAI | `gpt-4.1` | `https://api.openai.com/v1` |
137
+ | Local (Ollama) | `llama3` | `http://localhost:11434/v1` |
138
+ | Anthropic native | `claude-opus-4-7` | *(leave unset)* |
139
+
140
+ If `SIMPLICIO_BASE_URL` is unset and the key is `ANTHROPIC_API_KEY`, it uses the
141
+ native Anthropic SDK. Otherwise it uses an OpenAI-compatible client pointed at
142
+ your `base_url` — so **any** OpenAI-like provider works without code changes.
143
+
144
+ ```bash
145
+ simplicio smoke # prints provider config + one test call
146
+ ```
147
+
148
+ ## Use
149
+
150
+ ```bash
151
+ # index once (caches embeddings; re-run after big changes)
152
+ simplicio index --stack angular
153
+
154
+ # run a task
155
+ simplicio task "hide Delete button for non-admins" \
156
+ --stack angular \
157
+ --target src/app/screen/screen.component.html \
158
+ --criteria "- no admin perm: button absent from DOM
159
+ - with admin perm: button present" \
160
+ --constraints "- don't touch save flow
161
+ - build passes"
162
+ ```
163
+
164
+ Each `task`: precedent (from cache) → skill match → 6 layers → LLM generates
165
+ (diff + test + Playwright) → apply → run `SIMPLICIO_TEST_CMD` → pass? **done** :
166
+ send the error back → fix → retry (up to 3x).
167
+
168
+ ---
169
+
170
+ ## Cache — why it doesn't re-map every time
171
+
172
+ Embeddings are keyed by **content hash**, stored in `.simplicio/`. Unchanged
173
+ code block → vector reused. Change one file → only that block re-embeds.
174
+
175
+ | Run | Blocks embedded | Time |
176
+ |---|---|---|
177
+ | 1st (cold cache) | 3 | ~baseline |
178
+ | 2nd (no change) | **0** | **~instant** |
179
+ | after editing 1 file | **1** | partial |
180
+
181
+ ---
182
+
183
+ ## Benchmark — reproduce in 30 seconds
184
+
185
+ ```bash
186
+ OPENROUTER_API_KEY=… \
187
+ BENCH_MODELS="deepseek/deepseek-v4-pro,qwen/qwen3.7-max,moonshotai/kimi-k2.6,openai/gpt-5.5,anthropic/claude-opus-4.7,google/gemini-3.5-flash" \
188
+ python3 bench/run_offline.py
189
+ ```
190
+
191
+ No project required, stdlib only, deterministic regex scoring — no LLM judges
192
+ the LLM. Each case runs twice on the **same** model: raw one-line objective vs
193
+ simplicio's 6-layer contract. Outputs scored on target-file mention, DIFF
194
+ block, TEST block, contract-state words. Full numbers in [`bench/results.md`](bench/results.md).
195
+
196
+ ### Full harness (your real project, your real tests)
197
+
198
+ ```bash
199
+ simplicio bench --cases bench/cases.json --stack angular
200
+ ```
201
+
202
+ Runs each case two ways and runs **your real test command** (e.g. `ng test
203
+ --watch=false`) on each output. Writes the true pass-rate to
204
+ [`bench/results.md`](bench/results.md).
205
+
206
+ ### 4-quadrant bench — agent × simplicio matrix
207
+
208
+ Adds the second axis: not just *"does the 6-layer wrap help one call?"* but
209
+ *"does it still help inside a retry loop?"*. Same model, same cases — only
210
+ the cell logic changes.
211
+
212
+ | | **no simplicio** | **with simplicio** |
213
+ | ----------------------- | ------------------------ | ------------------------ |
214
+ | **no agent** (1 call) | Q1 — baseline | Q2 — current bench |
215
+ | **with agent** (loop) | Q3 — loop only | Q4 — composition |
216
+
217
+ ```bash
218
+ pip install -e ".[bench]" # adds fpdf2 for PDF report
219
+ OPENROUTER_API_KEY=… \
220
+ BENCH_MODELS="google/gemma-3-4b-it" \
221
+ BENCH_MAX_ITERS=3 \
222
+ python3 bench/run_4quadrant.py
223
+ ```
224
+
225
+ Outputs `bench/results_4quadrant.{md,pdf,json}` + SVG charts under
226
+ `bench/charts/4q_*.svg` + per-iteration raw outputs under
227
+ `.simplicio/bench_4q/<model>/case_NN/q*_iter*.txt`. Methodology and
228
+ hypothesis decomposition: [`docs/benchmark-4quadrant.md`](docs/benchmark-4quadrant.md).
229
+
230
+ The matrix decomposes:
231
+
232
+ - **Prompt effect alone**: Q2 − Q1
233
+ - **Loop effect alone**: Q3 − Q1
234
+ - **Prompt effect inside loop**: Q4 − Q3 (does simplicio still matter once you loop?)
235
+ - **Composition gain over best single axis**: Q4 − max(Q2, Q3)
236
+ - **Synergy vs linear stacking**: Q4 − (Q1 + (Q2−Q1) + (Q3−Q1))
237
+
238
+ #### Run 1 — focused single-model, `google/gemma-3-4b-it`, 5 cases, max_iters=3 (2026-05-26)
239
+
240
+ | Quadrant | Prompt | Execution | Pass rate | Avg iters | Tokens / pass |
241
+ |---|---|---|---|---|---|
242
+ | **Q1** | raw goal | 1-shot | **0/5 (0%)** | 1.00 | 4,683 |
243
+ | **Q2** | simplicio 6-layer | 1-shot | **3/5 (60%)** | 1.00 | 800 |
244
+ | **Q3** | raw goal | loop w/ feedback | **2/5 (40%)** | 3.00 | 3,135 |
245
+ | **Q4** | simplicio 6-layer | loop w/ feedback | **4/5 (80%)** | 1.80 | 1,018 |
246
+
247
+ Decomposition (rejection threshold `|Δ| ≥ 5 pts`):
248
+
249
+ | Hypothesis | Δ | Verdict |
250
+ |---|---|---|
251
+ | Loop alone closes the gap (simplicio unnecessary once you loop) | Q4 − Q3 = **+40 pts** | **rejected** |
252
+ | Simplicio alone is enough (loop is overkill) | Q4 − Q2 = **+20 pts** | **rejected** |
253
+ | Gains stack linearly (no synergy) | Q4 − linear = **−20 pts** | **rejected** |
254
+
255
+ Cost per passing case: Q1 = 4,683 tok / 236s — Q2 = **800 tok / 21s** — Q3 = 3,135 tok / 109s — Q4 = **1,018 tok / 20s**. Full table + charts in [`bench/results_4quadrant.md`](bench/results_4quadrant.md).
256
+
257
+ #### Run 2 — wider multi-model, 3 models × 10 cases (partial), max_iters=5 (2026-05-26)
258
+
259
+ Replicated the matrix across more models and more cases. `qwen-2.5-7b` covers only the first 5 of 10 cases (wide run was killed mid-execution); `claude-3.5-haiku` not reached. Aggregate counts every observed `(model × case × quadrant)` tuple as one observation:
260
+
261
+ | Quadrant | Prompt | Execution | Pass rate | Avg iters | Tokens / pass | ms / pass |
262
+ |---|---|---|---|---|---|---|
263
+ | **Q1** | raw goal | 1-shot | **0/25 (0%)** | 1.00 | 22,387 | 817,437 |
264
+ | **Q2** | simplicio 6-layer | 1-shot | **16/25 (64%)** | 1.00 | 1,093 | 14,797 |
265
+ | **Q3** | raw goal | loop w/ feedback | **11/25 (44%)** | 4.00 | 7,154 | 106,382 |
266
+ | **Q4** | simplicio 6-layer | loop w/ feedback | **19/25 (76%)** | 2.44 | 1,914 | 24,170 |
267
+
268
+ Per-model breakdown:
269
+
270
+ | Model | Cases | Q1 | Q2 | Q3 | Q4 |
271
+ |---|---|---|---|---|---|
272
+ | `google/gemma-3-4b-it` | 10/10 | 0/10 (0%) | 7/10 (70%) | 4/10 (40%) | **8/10 (80%)** |
273
+ | `meta-llama/llama-3.2-3b-instruct` | 10/10 | 0/10 (0%) | 5/10 (50%) | 4/10 (40%) | **6/10 (60%)** |
274
+ | `qwen/qwen-2.5-7b-instruct` | 5/10 | 0/5 (0%) | 4/5 (80%) | 3/5 (60%) | **5/5 (100%)** |
275
+
276
+ Decomposition (rejection threshold `|Δ| ≥ 5 pts`):
277
+
278
+ | Hypothesis | Δ | Verdict |
279
+ |---|---|---|
280
+ | Loop alone closes the gap (simplicio unnecessary once you loop) | Q4 − Q3 = **+32 pts** | **rejected** |
281
+ | Simplicio alone is enough (loop is overkill) | Q4 − Q2 = **+12 pts** | **rejected** |
282
+ | Gains stack linearly (no synergy) | Q4 − linear = **−32 pts** | **rejected** |
283
+
284
+ Same picture at every scale: Q4 (composition) wins on pass-rate, **and** Q4 stays close to Q2 on cost (1.9k tok / 24s per pass vs. Q2's 1.1k / 15s) while Q3 burns 7.2k tok / 106s per pass for fewer passes. Full table + per-case breakdown in [`bench/results_4quadrant_wide.md`](bench/results_4quadrant_wide.md).
285
+
286
+ ---
287
+
288
+ ## Plug points (stubs marked in code)
289
+
290
+ | File | Replace with |
291
+ |---|---|
292
+ | `prompt.py::_mapper` | your real **llm-project-mapper** |
293
+ | `pipeline.py::_aplicar_e_testar` | extract diff → `git apply` → parse test result |
294
+ | `skill_router.py` | point `SIMPLICIO_SKILLS_DIR` at your mapper's skills |
295
+
296
+ ## Layout
297
+
298
+ ```
299
+ simplicio/
300
+ cli.py # index | task | bench | smoke
301
+ cache.py # content-hash embedding cache
302
+ precedent.py # grep + semantic rank (uses cache)
303
+ skill_router.py # picks the ONE matching skill
304
+ prompt.py # stacks the 6 layers
305
+ providers.py # any OpenAI-compatible endpoint + Anthropic native
306
+ pipeline.py # generate → test → fix loop
307
+ bench.py # with-vs-without harness
308
+ templates/simplicio_prompt.md
309
+ bench/
310
+ run_offline.py # stdlib-only multi-model benchmark
311
+ cases.json # your benchmark tasks
312
+ cases_offline.json
313
+ results.md # filled by `simplicio bench` / `run_offline.py`
314
+ charts/ # SVG: overall, delta, by_case, by_stack
315
+ ```
316
+
317
+ ## License
318
+ MIT