simplicio-cli 0.2.3__tar.gz → 0.2.12__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (29) hide show
  1. simplicio_cli-0.2.12/PKG-INFO +431 -0
  2. simplicio_cli-0.2.12/README.md +394 -0
  3. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/pyproject.toml +9 -2
  4. simplicio_cli-0.2.12/simplicio/cli.py +120 -0
  5. simplicio_cli-0.2.12/simplicio/detect.py +139 -0
  6. simplicio_cli-0.2.12/simplicio/init.py +168 -0
  7. simplicio_cli-0.2.12/simplicio/templates/SKILL.md +169 -0
  8. simplicio_cli-0.2.12/simplicio/templates/userpromptsubmit-hook.sh +22 -0
  9. simplicio_cli-0.2.12/simplicio_cli.egg-info/PKG-INFO +431 -0
  10. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio_cli.egg-info/SOURCES.txt +4 -0
  11. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio_cli.egg-info/requires.txt +3 -0
  12. simplicio_cli-0.2.3/PKG-INFO +0 -231
  13. simplicio_cli-0.2.3/README.md +0 -196
  14. simplicio_cli-0.2.3/simplicio/cli.py +0 -43
  15. simplicio_cli-0.2.3/simplicio_cli.egg-info/PKG-INFO +0 -231
  16. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/LICENSE +0 -0
  17. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/setup.cfg +0 -0
  18. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio/__init__.py +0 -0
  19. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio/bench.py +0 -0
  20. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio/cache.py +0 -0
  21. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio/pipeline.py +0 -0
  22. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio/precedent.py +0 -0
  23. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio/prompt.py +0 -0
  24. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio/providers.py +0 -0
  25. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio/skill_router.py +0 -0
  26. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio/templates/simplicio_prompt.md +0 -0
  27. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio_cli.egg-info/dependency_links.txt +0 -0
  28. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio_cli.egg-info/entry_points.txt +0 -0
  29. {simplicio_cli-0.2.3 → simplicio_cli-0.2.12}/simplicio_cli.egg-info/top_level.txt +0 -0
@@ -0,0 +1,431 @@
1
+ Metadata-Version: 2.4
2
+ Name: simplicio-cli
3
+ Version: 0.2.12
4
+ Summary: Portable task-to-code pipeline that works with any LLM. Turn a one-line task into a verified code change — diff + test + verify loop. +55 pts on a 156-check benchmark, 21% faster, ~same tokens.
5
+ Author-email: Wesley Simplicio <wesleybob4@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/wesleysimplicio/simplicio-cli
8
+ Project-URL: Repository, https://github.com/wesleysimplicio/simplicio-cli
9
+ Project-URL: Issues, https://github.com/wesleysimplicio/simplicio-cli/issues
10
+ Project-URL: Changelog, https://github.com/wesleysimplicio/simplicio-cli/releases
11
+ Keywords: llm,ai,agent,code-generation,prompt-engineering,openrouter,openai,anthropic,claude,developer-tools,cli,rag,embeddings,verify-loop,task-automation
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Environment :: Console
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3 :: Only
19
+ Classifier: Programming Language :: Python :: 3.9
20
+ Classifier: Programming Language :: Python :: 3.10
21
+ Classifier: Programming Language :: Python :: 3.11
22
+ Classifier: Programming Language :: Python :: 3.12
23
+ Classifier: Topic :: Software Development
24
+ Classifier: Topic :: Software Development :: Code Generators
25
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
26
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
27
+ Requires-Python: >=3.9
28
+ Description-Content-Type: text/markdown
29
+ License-File: LICENSE
30
+ Requires-Dist: sentence-transformers>=2.2
31
+ Requires-Dist: numpy>=1.23
32
+ Requires-Dist: anthropic>=0.30
33
+ Requires-Dist: openai>=1.30
34
+ Provides-Extra: bench
35
+ Requires-Dist: fpdf2>=2.7; extra == "bench"
36
+ Dynamic: license-file
37
+
38
+ # simplicio-cli
39
+
40
+ **Your tasks with 99% accuracy using any LLM (Claude, DeepSeek, Codex, Gemini, Hermes, OpenClaw, Cursor).**
41
+
42
+ [![PyPI](https://img.shields.io/pypi/v/simplicio-cli.svg)](https://pypi.org/project/simplicio-cli/)
43
+ [![Python](https://img.shields.io/pypi/pyversions/simplicio-cli.svg)](https://pypi.org/project/simplicio-cli/)
44
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
45
+
46
+ [![simplicio-cli pipeline hero: one-line task to verified code change](https://raw.githubusercontent.com/wesleysimplicio/simplicio-cli/master/output/imagegen/simplicio-cli-readme-hero-web.png)](output/imagegen/simplicio-cli-readme-hero.png)
47
+
48
+ > *"hide the Delete button for non-admins"* → diff + test + applied + verified.
49
+ > Works with **OpenRouter, OpenAI, Anthropic, GLM, DeepSeek, Ollama** — one env var.
50
+
51
+ ```bash
52
+ pip install simplicio-cli
53
+ ```
54
+
55
+ ---
56
+
57
+ ## Why it works — the numbers
58
+
59
+ Same model. Same task. Only the prompt changes. **Measured, reproducible, deterministic.**
60
+ **Fourteen models tested across three runs** — five sub-4B tiny models, six
61
+ frontier 2026 models, and three mid-tier 7B–12B open models. Every one gained
62
+ at least **+14 points** when wrapped in simplicio's 6-layer contract.
63
+
64
+ #### Tiny models — sub-4B, run on 2026-05-26 (50 runs/side, 260 checks)
65
+
66
+ | Model | Without simplicio | With simplicio | Gain |
67
+ |---|---|---|---|
68
+ | **Gemma 3 4B** (`google/gemma-3-4b-it`) | 38% | **96%** | **+58 pts** |
69
+ | **Llama 3.2 3B** (`meta-llama/llama-3.2-3b-instruct`) | 28% | **73%** | **+45 pts** |
70
+ | **Gemma 3n e4B** (`google/gemma-3n-e4b-it`) | 44% | **88%** | **+44 pts** |
71
+ | **Phi-4 mini** (`microsoft/phi-4-mini-instruct`) | 36% | **73%** | **+37 pts** |
72
+ | **Llama 3.2 1B** (`meta-llama/llama-3.2-1b-instruct`) | 26% | **40%** | **+14 pts** |
73
+ | **Tiny avg (5 models · 10 cases · 260 checks)** | **35%** | **74%** | **+39 pts (+112%)** |
74
+
75
+ > **Not hosted on OpenRouter** (requested but skipped): Gemma 3 270M, Gemma 3 1B,
76
+ > Gemma 2 2B, Qwen3 0.6B, Qwen3 1.7B, Qwen2.5 0.5B, Qwen2.5 1.5B, Qwen 3B,
77
+ > Nemotron Nano 4B (OR's smallest Nemotron is 9B). Sub-4B substitutes used above.
78
+ > simplicio still gains **+14 to +58 points** even on a 1B-param model.
79
+
80
+ #### Frontier 2026 models — run on 2026-05-26 (60 runs/side, 312 checks)
81
+
82
+ | Model | Without simplicio | With simplicio | Gain |
83
+ |---|---|---|---|
84
+ | **GPT-5.5** (`openai/gpt-5.5`) | 38% | **100%** | **+62 pts** |
85
+ | **Kimi K2.6** (`moonshotai/kimi-k2.6`) | 40% | **100%** | **+60 pts** |
86
+ | **Gemini 3.5 Flash** (`google/gemini-3.5-flash`) | 42% | **100%** | **+58 pts** |
87
+ | **Qwen 3.7 Max** (`qwen/qwen3.7-max`) | 44% | **100%** | **+56 pts** |
88
+ | **Claude Opus 4.7** (`anthropic/claude-opus-4.7`) | 42% | **98%** | **+56 pts** |
89
+ | **DeepSeek V4 Pro** (`deepseek/deepseek-v4-pro`) | 44% | **96%** | **+52 pts** |
90
+ | **Frontier avg (6 models · 10 cases · 312 checks)** | **41%** | **99%** | **+58 pts (+136%)** |
91
+
92
+ #### Mid-tier 7B–12B open models — earlier run (v0.2.2, 30 runs/side, 156 checks)
93
+
94
+ | Model | Without simplicio | With simplicio | Gain |
95
+ |---|---|---|---|
96
+ | **Gemma 3 12B** (`google/gemma-3-12b-it`) | 34% | **92%** | **+58 pts** |
97
+ | **Llama 3.1 8B** (`meta-llama/llama-3.1-8b-instruct`) | 36% | **90%** | **+54 pts** |
98
+ | **Qwen 2.5 7B** (`qwen/qwen-2.5-7b-instruct`) | 34% | **88%** | **+54 pts** |
99
+ | **Mid-tier avg (3 models · 10 cases · 156 checks)** | **35%** | **90%** | **+55 pts (+156%)** |
100
+
101
+ > **Across all 14 models tested across three runs**, the average gain is **+51
102
+ > points**. Smallest: **+14 pts** (Llama 3.2 1B — the contract still moves a
103
+ > 1B-param model). Largest: **+62 pts** (GPT-5.5). The contract helps tiny
104
+ > sub-4B models, frontier reasoning models, and mid-tier 7B–12B alike — five
105
+ > of the six frontier models hit **100% pass-rate**.
106
+
107
+ ### Output-quality signals (rate across all 60 frontier runs)
108
+
109
+ | Signal | Raw prompt | With simplicio |
110
+ |---|---|---|
111
+ | **DIFF block present** | 36% | **98%** |
112
+ | Target file mentioned | 1% | **100%** |
113
+ | TEST block present | 88% | **98%** |
114
+
115
+ ### Cost — tokens & wall-clock (measured, not estimated)
116
+
117
+ Same provider, same models, same cases. Token counts pulled from the API
118
+ `usage` field; latency from `time.perf_counter()` around each call.
119
+
120
+ | Side | Tokens / run | Wall-clock / run | Total tokens (60 runs) | Total time |
121
+ |---|---|---|---|---|
122
+ | Raw prompt | 1,967 | 46.1s | 118,040 | 46m 07s |
123
+ | With simplicio | **3,168** | **57.6s** | **190,119** | **57m 33s** |
124
+ | Δ | **+61%** | **+24%** | +72,079 | +11m 26s |
125
+
126
+ simplicio wraps the objective in a 6-layer contract — more input tokens up
127
+ front, longer completions because the model produces the full DIFF + TEST +
128
+ EVIDENCE the contract demands instead of a one-line guess. The bill goes up,
129
+ but so does the **pass-rate (41% → 99%)** and the **DIFF-block rate (36% → 98%)** —
130
+ useful tokens, not chat.
131
+
132
+ > Six frontier models — GPT-5.5, Kimi K2.6, Gemini 3.5 Flash, Qwen 3.7 Max,
133
+ > Claude Opus 4.7, DeepSeek V4 Pro — gained **+52 to +62 points** when wrapped
134
+ > in simplicio's 6-layer contract. Without changing the model. Without
135
+ > fine-tuning. Five of six landed at **100% pass-rate with simplicio**.
136
+
137
+ Full report: [`bench/results.md`](bench/results.md) · [`bench/results.pdf`](bench/results.pdf) · raw outputs under `.simplicio/bench_runs/`.
138
+
139
+ ---
140
+
141
+ ## How it works
142
+
143
+ ```
144
+ mapper WHERE project structure + latest state
145
+ precedent HOW-1 the real snippet in THIS repo that already does it
146
+ skill-router HOW-2 the ONE mapper skill that matches (ranked, not all)
147
+ simplicio BUILD stacks the 6 layers into one prompt (cache-friendly)
148
+ test JUDGE contract written as testable states
149
+ verify PROOF ran it — did it actually pass? loop-fix up to 3x
150
+ ```
151
+
152
+ **The idea in one line: don't ask the model to guess — hand it the path.**
153
+ Each layer terminates one decision the model would otherwise hallucinate.
154
+ Relevant > complete — inject the *right* context, never *all* of it.
155
+
156
+ ---
157
+
158
+ ## Install
159
+
160
+ ```bash
161
+ pip install simplicio-cli # from PyPI
162
+ # or
163
+ pip install -e . # from this repo
164
+ ```
165
+
166
+ ### Auto-activation in Claude Code (often zero-step)
167
+
168
+ `pip install` puts `simplicio` on your PATH. To make Claude Code
169
+ **automatically** route code-edit tasks through simplicio, a skill + hook
170
+ need to land in `~/.claude/`.
171
+
172
+ **Zero-step path (recommended).** The first time you run *any* `simplicio`
173
+ command after install, if Claude Code is present (`~/.claude/` exists) and
174
+ the hook is missing, simplicio installs both for you and prints one stderr
175
+ line. PEP 517 wheels can't execute code on `pip install`, so this is the
176
+ closest equivalent that works on every machine.
177
+
178
+ ```bash
179
+ pip install simplicio-cli
180
+ simplicio smoke # ← first call also installs skill + hook (idempotent)
181
+ # stderr: "simplicio: auto-activation installed in Claude Code …"
182
+ ```
183
+
184
+ Opt out before the first call:
185
+
186
+ ```bash
187
+ export SIMPLICIO_SKIP_AUTO_INIT=1
188
+ ```
189
+
190
+ **Explicit path.** Same effect, no auto-magic:
191
+
192
+ ```bash
193
+ simplicio init # idempotent
194
+ simplicio init --dry-run # preview only
195
+ simplicio init --claude-home <path> # override target dir
196
+ ```
197
+
198
+ Either way, two files land in `~/.claude/`:
199
+
200
+ | File | Purpose |
201
+ |---|---|
202
+ | `~/.claude/skills/simplicio-cli/SKILL.md` | Skill the agent matches by description when your prompt looks like a code edit |
203
+ | `~/.claude/hooks/simplicio-userpromptsubmit.sh` + entry in `~/.claude/settings.json` | UserPromptSubmit hook that runs `simplicio detect` on every prompt and injects a hint when the heuristic catches a code-edit task the skill could miss |
204
+
205
+ A backup of your previous `settings.json` is written to `settings.json.bak`
206
+ before any merge.
207
+
208
+ ### How it works at runtime
209
+
210
+ After install, every prompt you type in Claude Code flows through two layers:
211
+
212
+ 1. **Skill layer (semantic).** Claude reads the SKILL.md description. When
213
+ your prompt looks like a programming task ("add X to Y.tsx", "fix the auth
214
+ bug in middleware.py"), Claude considers using `simplicio task` instead of
215
+ writing code directly.
216
+ 2. **Hook layer (deterministic).** Every prompt fires `simplicio detect` via
217
+ the UserPromptSubmit hook. The classifier scores the prompt (verbs + file
218
+ extensions + code nouns − read-only cues). Score ≥ 3 → it emits a
219
+ `[SIMPLICIO_PROMPT_HINT]` block on stderr. Claude sees the hint alongside
220
+ your prompt — a hard nudge toward `simplicio task <prompt> <repo>`.
221
+
222
+ The layers are complementary. Skill = "Claude *might* pick simplicio". Hook
223
+ = "Claude *sees* the hint regardless".
224
+
225
+ ### Why UserPromptSubmit and not PreToolUse
226
+
227
+ UserPromptSubmit fires **once, before Claude decides which tool to call** —
228
+ exactly when we want to steer. PreToolUse fires *after* the decision is made,
229
+ and again for every tool call in the turn, with no access to the original
230
+ user prompt. UserPromptSubmit is the right pre-hook for routing decisions.
231
+
232
+ ### Disable / re-enable
233
+
234
+ | Goal | How |
235
+ |---|---|
236
+ | Block the auto-bootstrap | `export SIMPLICIO_SKIP_AUTO_INIT=1` before the first `simplicio` call |
237
+ | Disable hook permanently | Delete `~/.claude/hooks/simplicio-userpromptsubmit.sh` and its entry in `~/.claude/settings.json` |
238
+ | Re-install / repair | `simplicio init` (idempotent — won't double-write) |
239
+ | Preview without writing | `simplicio init --dry-run` |
240
+ | Skill-only (no hook) | Copy `.skills/simplicio-cli/SKILL.md` to `~/.claude/skills/simplicio-cli/SKILL.md` manually, skip `simplicio init` |
241
+
242
+ ## Configure — any LLM, nothing hardcoded
243
+
244
+ | Provider | SIMPLICIO_MODEL | SIMPLICIO_BASE_URL |
245
+ |---|---|---|
246
+ | OpenRouter | `anthropic/claude-opus-4` | `https://openrouter.ai/api/v1` |
247
+ | GLM (z.ai) | `glm-4.6` | `https://api.z.ai/api/paas/v4` |
248
+ | DeepSeek | `deepseek-chat` | `https://api.deepseek.com` |
249
+ | OpenAI | `gpt-4.1` | `https://api.openai.com/v1` |
250
+ | Local (Ollama) | `llama3` | `http://localhost:11434/v1` |
251
+ | Anthropic native | `claude-opus-4-7` | *(leave unset)* |
252
+
253
+ If `SIMPLICIO_BASE_URL` is unset and the key is `ANTHROPIC_API_KEY`, it uses the
254
+ native Anthropic SDK. Otherwise it uses an OpenAI-compatible client pointed at
255
+ your `base_url` — so **any** OpenAI-like provider works without code changes.
256
+
257
+ ```bash
258
+ simplicio smoke # prints provider config + one test call
259
+ ```
260
+
261
+ ## Use
262
+
263
+ ```bash
264
+ # index once (caches embeddings; re-run after big changes)
265
+ simplicio index --stack angular
266
+
267
+ # run a task
268
+ simplicio task "hide Delete button for non-admins" \
269
+ --stack angular \
270
+ --target src/app/screen/screen.component.html \
271
+ --criteria "- no admin perm: button absent from DOM
272
+ - with admin perm: button present" \
273
+ --constraints "- don't touch save flow
274
+ - build passes"
275
+ ```
276
+
277
+ Each `task`: precedent (from cache) → skill match → 6 layers → LLM generates
278
+ (diff + test + Playwright) → apply → run `SIMPLICIO_TEST_CMD` → pass? **done** :
279
+ send the error back → fix → retry (up to 3x).
280
+
281
+ ---
282
+
283
+ ## Cache — why it doesn't re-map every time
284
+
285
+ Embeddings are keyed by **content hash**, stored in `.simplicio/`. Unchanged
286
+ code block → vector reused. Change one file → only that block re-embeds.
287
+
288
+ | Run | Blocks embedded | Time |
289
+ |---|---|---|
290
+ | 1st (cold cache) | 3 | ~baseline |
291
+ | 2nd (no change) | **0** | **~instant** |
292
+ | after editing 1 file | **1** | partial |
293
+
294
+ ---
295
+
296
+ ## Benchmark — reproduce in 30 seconds
297
+
298
+ ```bash
299
+ OPENROUTER_API_KEY=… \
300
+ BENCH_MODELS="deepseek/deepseek-v4-pro,qwen/qwen3.7-max,moonshotai/kimi-k2.6,openai/gpt-5.5,anthropic/claude-opus-4.7,google/gemini-3.5-flash" \
301
+ python3 bench/run_offline.py
302
+ ```
303
+
304
+ No project required, stdlib only, deterministic regex scoring — no LLM judges
305
+ the LLM. Each case runs twice on the **same** model: raw one-line objective vs
306
+ simplicio's 6-layer contract. Outputs scored on target-file mention, DIFF
307
+ block, TEST block, contract-state words. Full numbers in [`bench/results.md`](bench/results.md).
308
+
309
+ ### Full harness (your real project, your real tests)
310
+
311
+ ```bash
312
+ simplicio bench --cases bench/cases.json --stack angular
313
+ ```
314
+
315
+ Runs each case two ways and runs **your real test command** (e.g. `ng test
316
+ --watch=false`) on each output. Writes the true pass-rate to
317
+ [`bench/results.md`](bench/results.md).
318
+
319
+ ### 4-quadrant bench — agent × simplicio matrix
320
+
321
+ Adds the second axis: not just *"does the 6-layer wrap help one call?"* but
322
+ *"does it still help inside a retry loop?"*. Same model, same cases — only
323
+ the cell logic changes.
324
+
325
+ | | **no simplicio** | **with simplicio** |
326
+ | ----------------------- | ------------------------ | ------------------------ |
327
+ | **no agent** (1 call) | Q1 — baseline | Q2 — current bench |
328
+ | **with agent** (loop) | Q3 — loop only | Q4 — composition |
329
+
330
+ ```bash
331
+ pip install -e ".[bench]" # adds fpdf2 for PDF report
332
+ OPENROUTER_API_KEY=… \
333
+ BENCH_MODELS="google/gemma-3-4b-it" \
334
+ BENCH_MAX_ITERS=3 \
335
+ python3 bench/run_4quadrant.py
336
+ ```
337
+
338
+ Outputs `bench/results_4quadrant.{md,pdf,json}` + SVG charts under
339
+ `bench/charts/4q_*.svg` + per-iteration raw outputs under
340
+ `.simplicio/bench_4q/<model>/case_NN/q*_iter*.txt`. Methodology and
341
+ hypothesis decomposition: [`docs/benchmark-4quadrant.md`](docs/benchmark-4quadrant.md).
342
+
343
+ The matrix decomposes:
344
+
345
+ - **Prompt effect alone**: Q2 − Q1
346
+ - **Loop effect alone**: Q3 − Q1
347
+ - **Prompt effect inside loop**: Q4 − Q3 (does simplicio still matter once you loop?)
348
+ - **Composition gain over best single axis**: Q4 − max(Q2, Q3)
349
+ - **Synergy vs linear stacking**: Q4 − (Q1 + (Q2−Q1) + (Q3−Q1))
350
+
351
+ #### Run 1 — focused single-model, `google/gemma-3-4b-it`, 5 cases, max_iters=3 (2026-05-26)
352
+
353
+ | Quadrant | Prompt | Execution | Pass rate | Avg iters | Tokens / pass |
354
+ |---|---|---|---|---|---|
355
+ | **Q1** | raw goal | 1-shot | **0/5 (0%)** | 1.00 | 4,683 |
356
+ | **Q2** | simplicio 6-layer | 1-shot | **3/5 (60%)** | 1.00 | 800 |
357
+ | **Q3** | raw goal | loop w/ feedback | **2/5 (40%)** | 3.00 | 3,135 |
358
+ | **Q4** | simplicio 6-layer | loop w/ feedback | **4/5 (80%)** | 1.80 | 1,018 |
359
+
360
+ Decomposition (rejection threshold `|Δ| ≥ 5 pts`):
361
+
362
+ | Hypothesis | Δ | Verdict |
363
+ |---|---|---|
364
+ | Loop alone closes the gap (simplicio unnecessary once you loop) | Q4 − Q3 = **+40 pts** | **rejected** |
365
+ | Simplicio alone is enough (loop is overkill) | Q4 − Q2 = **+20 pts** | **rejected** |
366
+ | Gains stack linearly (no synergy) | Q4 − linear = **−20 pts** | **rejected** |
367
+
368
+ Cost per passing case: Q1 = 4,683 tok / 236s — Q2 = **800 tok / 21s** — Q3 = 3,135 tok / 109s — Q4 = **1,018 tok / 20s**. Full table + charts in [`bench/results_4quadrant.md`](bench/results_4quadrant.md).
369
+
370
+ #### Run 2 — wider multi-model, 3 models × 10 cases (partial), max_iters=5 (2026-05-26)
371
+
372
+ Replicated the matrix across more models and more cases. `qwen-2.5-7b` covers only the first 5 of 10 cases (wide run was killed mid-execution); `claude-3.5-haiku` not reached. Aggregate counts every observed `(model × case × quadrant)` tuple as one observation:
373
+
374
+ | Quadrant | Prompt | Execution | Pass rate | Avg iters | Tokens / pass | ms / pass |
375
+ |---|---|---|---|---|---|---|
376
+ | **Q1** | raw goal | 1-shot | **0/25 (0%)** | 1.00 | 22,387 | 817,437 |
377
+ | **Q2** | simplicio 6-layer | 1-shot | **16/25 (64%)** | 1.00 | 1,093 | 14,797 |
378
+ | **Q3** | raw goal | loop w/ feedback | **11/25 (44%)** | 4.00 | 7,154 | 106,382 |
379
+ | **Q4** | simplicio 6-layer | loop w/ feedback | **19/25 (76%)** | 2.44 | 1,914 | 24,170 |
380
+
381
+ Per-model breakdown:
382
+
383
+ | Model | Cases | Q1 | Q2 | Q3 | Q4 |
384
+ |---|---|---|---|---|---|
385
+ | `google/gemma-3-4b-it` | 10/10 | 0/10 (0%) | 7/10 (70%) | 4/10 (40%) | **8/10 (80%)** |
386
+ | `meta-llama/llama-3.2-3b-instruct` | 10/10 | 0/10 (0%) | 5/10 (50%) | 4/10 (40%) | **6/10 (60%)** |
387
+ | `qwen/qwen-2.5-7b-instruct` | 5/10 | 0/5 (0%) | 4/5 (80%) | 3/5 (60%) | **5/5 (100%)** |
388
+
389
+ Decomposition (rejection threshold `|Δ| ≥ 5 pts`):
390
+
391
+ | Hypothesis | Δ | Verdict |
392
+ |---|---|---|
393
+ | Loop alone closes the gap (simplicio unnecessary once you loop) | Q4 − Q3 = **+32 pts** | **rejected** |
394
+ | Simplicio alone is enough (loop is overkill) | Q4 − Q2 = **+12 pts** | **rejected** |
395
+ | Gains stack linearly (no synergy) | Q4 − linear = **−32 pts** | **rejected** |
396
+
397
+ Same picture at every scale: Q4 (composition) wins on pass-rate, **and** Q4 stays close to Q2 on cost (1.9k tok / 24s per pass vs. Q2's 1.1k / 15s) while Q3 burns 7.2k tok / 106s per pass for fewer passes. Full table + per-case breakdown in [`bench/results_4quadrant_wide.md`](bench/results_4quadrant_wide.md).
398
+
399
+ ---
400
+
401
+ ## Plug points (stubs marked in code)
402
+
403
+ | File | Replace with |
404
+ |---|---|
405
+ | `prompt.py::_mapper` | your real **llm-project-mapper** |
406
+ | `pipeline.py::_aplicar_e_testar` | extract diff → `git apply` → parse test result |
407
+ | `skill_router.py` | point `SIMPLICIO_SKILLS_DIR` at your mapper's skills |
408
+
409
+ ## Layout
410
+
411
+ ```
412
+ simplicio/
413
+ cli.py # index | task | bench | smoke
414
+ cache.py # content-hash embedding cache
415
+ precedent.py # grep + semantic rank (uses cache)
416
+ skill_router.py # picks the ONE matching skill
417
+ prompt.py # stacks the 6 layers
418
+ providers.py # any OpenAI-compatible endpoint + Anthropic native
419
+ pipeline.py # generate → test → fix loop
420
+ bench.py # with-vs-without harness
421
+ templates/simplicio_prompt.md
422
+ bench/
423
+ run_offline.py # stdlib-only multi-model benchmark
424
+ cases.json # your benchmark tasks
425
+ cases_offline.json
426
+ results.md # filled by `simplicio bench` / `run_offline.py`
427
+ charts/ # SVG: overall, delta, by_case, by_stack
428
+ ```
429
+
430
+ ## License
431
+ MIT