simplicio-cli 0.2.2__tar.gz → 0.2.9__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- simplicio_cli-0.2.9/PKG-INFO +355 -0
- simplicio_cli-0.2.9/README.md +318 -0
- {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/pyproject.toml +4 -1
- simplicio_cli-0.2.9/simplicio/bench.py +51 -0
- {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio/cache.py +20 -20
- simplicio_cli-0.2.9/simplicio/cli.py +43 -0
- simplicio_cli-0.2.9/simplicio/pipeline.py +28 -0
- {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio/precedent.py +24 -24
- simplicio_cli-0.2.9/simplicio/prompt.py +25 -0
- simplicio_cli-0.2.9/simplicio/providers.py +63 -0
- simplicio_cli-0.2.9/simplicio/skill_router.py +48 -0
- simplicio_cli-0.2.9/simplicio/templates/simplicio_prompt.md +43 -0
- simplicio_cli-0.2.9/simplicio_cli.egg-info/PKG-INFO +355 -0
- {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio_cli.egg-info/requires.txt +3 -0
- simplicio_cli-0.2.2/PKG-INFO +0 -231
- simplicio_cli-0.2.2/README.md +0 -196
- simplicio_cli-0.2.2/simplicio/bench.py +0 -51
- simplicio_cli-0.2.2/simplicio/cli.py +0 -43
- simplicio_cli-0.2.2/simplicio/pipeline.py +0 -28
- simplicio_cli-0.2.2/simplicio/prompt.py +0 -25
- simplicio_cli-0.2.2/simplicio/providers.py +0 -62
- simplicio_cli-0.2.2/simplicio/skill_router.py +0 -48
- simplicio_cli-0.2.2/simplicio/templates/simplicio_prompt.md +0 -43
- simplicio_cli-0.2.2/simplicio_cli.egg-info/PKG-INFO +0 -231
- {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/LICENSE +0 -0
- {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/setup.cfg +0 -0
- {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio/__init__.py +0 -0
- {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio_cli.egg-info/SOURCES.txt +0 -0
- {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio_cli.egg-info/dependency_links.txt +0 -0
- {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio_cli.egg-info/entry_points.txt +0 -0
- {simplicio_cli-0.2.2 → simplicio_cli-0.2.9}/simplicio_cli.egg-info/top_level.txt +0 -0
|
@@ -0,0 +1,355 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: simplicio-cli
|
|
3
|
+
Version: 0.2.9
|
|
4
|
+
Summary: Portable task-to-code pipeline that works with any LLM. Turn a one-line task into a verified code change — diff + test + verify loop. +55 pts on a 156-check benchmark, 21% faster, ~same tokens.
|
|
5
|
+
Author-email: Wesley Simplicio <wesleybob4@gmail.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/wesleysimplicio/simplicio-cli
|
|
8
|
+
Project-URL: Repository, https://github.com/wesleysimplicio/simplicio-cli
|
|
9
|
+
Project-URL: Issues, https://github.com/wesleysimplicio/simplicio-cli/issues
|
|
10
|
+
Project-URL: Changelog, https://github.com/wesleysimplicio/simplicio-cli/releases
|
|
11
|
+
Keywords: llm,ai,agent,code-generation,prompt-engineering,openrouter,openai,anthropic,claude,developer-tools,cli,rag,embeddings,verify-loop,task-automation
|
|
12
|
+
Classifier: Development Status :: 4 - Beta
|
|
13
|
+
Classifier: Environment :: Console
|
|
14
|
+
Classifier: Intended Audience :: Developers
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Operating System :: OS Independent
|
|
17
|
+
Classifier: Programming Language :: Python :: 3
|
|
18
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
22
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
23
|
+
Classifier: Topic :: Software Development
|
|
24
|
+
Classifier: Topic :: Software Development :: Code Generators
|
|
25
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
26
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
27
|
+
Requires-Python: >=3.9
|
|
28
|
+
Description-Content-Type: text/markdown
|
|
29
|
+
License-File: LICENSE
|
|
30
|
+
Requires-Dist: sentence-transformers>=2.2
|
|
31
|
+
Requires-Dist: numpy>=1.23
|
|
32
|
+
Requires-Dist: anthropic>=0.30
|
|
33
|
+
Requires-Dist: openai>=1.30
|
|
34
|
+
Provides-Extra: bench
|
|
35
|
+
Requires-Dist: fpdf2>=2.7; extra == "bench"
|
|
36
|
+
Dynamic: license-file
|
|
37
|
+
|
|
38
|
+
# simplicio-cli
|
|
39
|
+
|
|
40
|
+
**Your tasks with 99% accuracy using any LLM (Claude, DeepSeek, Codex, Gemini, Hermes, OpenClaw, Cursor).**
|
|
41
|
+
|
|
42
|
+
[](https://pypi.org/project/simplicio-cli/)
|
|
43
|
+
[](https://pypi.org/project/simplicio-cli/)
|
|
44
|
+
[](LICENSE)
|
|
45
|
+
|
|
46
|
+
[](output/imagegen/simplicio-cli-readme-hero.png)
|
|
47
|
+
|
|
48
|
+
> *"hide the Delete button for non-admins"* → diff + test + applied + verified.
|
|
49
|
+
> Works with **OpenRouter, OpenAI, Anthropic, GLM, DeepSeek, Ollama** — one env var.
|
|
50
|
+
|
|
51
|
+
```bash
|
|
52
|
+
pip install simplicio-cli
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
## Why it works — the numbers
|
|
58
|
+
|
|
59
|
+
Same model. Same task. Only the prompt changes. **Measured, reproducible, deterministic.**
|
|
60
|
+
**Fourteen models tested across three runs** — five sub-4B tiny models, six
|
|
61
|
+
frontier 2026 models, and three mid-tier 7B–12B open models. Every one gained
|
|
62
|
+
at least **+14 points** when wrapped in simplicio's 6-layer contract.
|
|
63
|
+
|
|
64
|
+
#### Tiny models — sub-4B, run on 2026-05-26 (50 runs/side, 260 checks)
|
|
65
|
+
|
|
66
|
+
| Model | Without simplicio | With simplicio | Gain |
|
|
67
|
+
|---|---|---|---|
|
|
68
|
+
| **Gemma 3 4B** (`google/gemma-3-4b-it`) | 38% | **96%** | **+58 pts** |
|
|
69
|
+
| **Llama 3.2 3B** (`meta-llama/llama-3.2-3b-instruct`) | 28% | **73%** | **+45 pts** |
|
|
70
|
+
| **Gemma 3n e4B** (`google/gemma-3n-e4b-it`) | 44% | **88%** | **+44 pts** |
|
|
71
|
+
| **Phi-4 mini** (`microsoft/phi-4-mini-instruct`) | 36% | **73%** | **+37 pts** |
|
|
72
|
+
| **Llama 3.2 1B** (`meta-llama/llama-3.2-1b-instruct`) | 26% | **40%** | **+14 pts** |
|
|
73
|
+
| **Tiny avg (5 models · 10 cases · 260 checks)** | **35%** | **74%** | **+39 pts (+112%)** |
|
|
74
|
+
|
|
75
|
+
> **Not hosted on OpenRouter** (requested but skipped): Gemma 3 270M, Gemma 3 1B,
|
|
76
|
+
> Gemma 2 2B, Qwen3 0.6B, Qwen3 1.7B, Qwen2.5 0.5B, Qwen2.5 1.5B, Qwen 3B,
|
|
77
|
+
> Nemotron Nano 4B (OR's smallest Nemotron is 9B). Sub-4B substitutes used above.
|
|
78
|
+
> simplicio still gains **+14 to +58 points** even on a 1B-param model.
|
|
79
|
+
|
|
80
|
+
#### Frontier 2026 models — run on 2026-05-26 (60 runs/side, 312 checks)
|
|
81
|
+
|
|
82
|
+
| Model | Without simplicio | With simplicio | Gain |
|
|
83
|
+
|---|---|---|---|
|
|
84
|
+
| **GPT-5.5** (`openai/gpt-5.5`) | 38% | **100%** | **+62 pts** |
|
|
85
|
+
| **Kimi K2.6** (`moonshotai/kimi-k2.6`) | 40% | **100%** | **+60 pts** |
|
|
86
|
+
| **Gemini 3.5 Flash** (`google/gemini-3.5-flash`) | 42% | **100%** | **+58 pts** |
|
|
87
|
+
| **Qwen 3.7 Max** (`qwen/qwen3.7-max`) | 44% | **100%** | **+56 pts** |
|
|
88
|
+
| **Claude Opus 4.7** (`anthropic/claude-opus-4.7`) | 42% | **98%** | **+56 pts** |
|
|
89
|
+
| **DeepSeek V4 Pro** (`deepseek/deepseek-v4-pro`) | 44% | **96%** | **+52 pts** |
|
|
90
|
+
| **Frontier avg (6 models · 10 cases · 312 checks)** | **41%** | **99%** | **+58 pts (+136%)** |
|
|
91
|
+
|
|
92
|
+
#### Mid-tier 7B–12B open models — earlier run (v0.2.2, 30 runs/side, 156 checks)
|
|
93
|
+
|
|
94
|
+
| Model | Without simplicio | With simplicio | Gain |
|
|
95
|
+
|---|---|---|---|
|
|
96
|
+
| **Gemma 3 12B** (`google/gemma-3-12b-it`) | 34% | **92%** | **+58 pts** |
|
|
97
|
+
| **Llama 3.1 8B** (`meta-llama/llama-3.1-8b-instruct`) | 36% | **90%** | **+54 pts** |
|
|
98
|
+
| **Qwen 2.5 7B** (`qwen/qwen-2.5-7b-instruct`) | 34% | **88%** | **+54 pts** |
|
|
99
|
+
| **Mid-tier avg (3 models · 10 cases · 156 checks)** | **35%** | **90%** | **+55 pts (+156%)** |
|
|
100
|
+
|
|
101
|
+
> **Across all 14 models tested across three runs**, the average gain is **+51
|
|
102
|
+
> points**. Smallest: **+14 pts** (Llama 3.2 1B — the contract still moves a
|
|
103
|
+
> 1B-param model). Largest: **+62 pts** (GPT-5.5). The contract helps tiny
|
|
104
|
+
> sub-4B models, frontier reasoning models, and mid-tier 7B–12B alike — five
|
|
105
|
+
> of the six frontier models hit **100% pass-rate**.
|
|
106
|
+
|
|
107
|
+
### Output-quality signals (rate across all 60 frontier runs)
|
|
108
|
+
|
|
109
|
+
| Signal | Raw prompt | With simplicio |
|
|
110
|
+
|---|---|---|
|
|
111
|
+
| **DIFF block present** | 36% | **98%** |
|
|
112
|
+
| Target file mentioned | 1% | **100%** |
|
|
113
|
+
| TEST block present | 88% | **98%** |
|
|
114
|
+
|
|
115
|
+
### Cost — tokens & wall-clock (measured, not estimated)
|
|
116
|
+
|
|
117
|
+
Same provider, same models, same cases. Token counts pulled from the API
|
|
118
|
+
`usage` field; latency from `time.perf_counter()` around each call.
|
|
119
|
+
|
|
120
|
+
| Side | Tokens / run | Wall-clock / run | Total tokens (60 runs) | Total time |
|
|
121
|
+
|---|---|---|---|---|
|
|
122
|
+
| Raw prompt | 1,967 | 46.1s | 118,040 | 46m 07s |
|
|
123
|
+
| With simplicio | **3,168** | **57.6s** | **190,119** | **57m 33s** |
|
|
124
|
+
| Δ | **+61%** | **+24%** | +72,079 | +11m 26s |
|
|
125
|
+
|
|
126
|
+
simplicio wraps the objective in a 6-layer contract — more input tokens up
|
|
127
|
+
front, longer completions because the model produces the full DIFF + TEST +
|
|
128
|
+
EVIDENCE the contract demands instead of a one-line guess. The bill goes up,
|
|
129
|
+
but so does the **pass-rate (41% → 99%)** and the **DIFF-block rate (36% → 98%)** —
|
|
130
|
+
useful tokens, not chat.
|
|
131
|
+
|
|
132
|
+
> Six frontier models — GPT-5.5, Kimi K2.6, Gemini 3.5 Flash, Qwen 3.7 Max,
|
|
133
|
+
> Claude Opus 4.7, DeepSeek V4 Pro — gained **+52 to +62 points** when wrapped
|
|
134
|
+
> in simplicio's 6-layer contract. Without changing the model. Without
|
|
135
|
+
> fine-tuning. Five of six landed at **100% pass-rate with simplicio**.
|
|
136
|
+
|
|
137
|
+
Full report: [`bench/results.md`](bench/results.md) · [`bench/results.pdf`](bench/results.pdf) · raw outputs under `.simplicio/bench_runs/`.
|
|
138
|
+
|
|
139
|
+
---
|
|
140
|
+
|
|
141
|
+
## How it works
|
|
142
|
+
|
|
143
|
+
```
|
|
144
|
+
mapper WHERE project structure + latest state
|
|
145
|
+
precedent HOW-1 the real snippet in THIS repo that already does it
|
|
146
|
+
skill-router HOW-2 the ONE mapper skill that matches (ranked, not all)
|
|
147
|
+
simplicio BUILD stacks the 6 layers into one prompt (cache-friendly)
|
|
148
|
+
test JUDGE contract written as testable states
|
|
149
|
+
verify PROOF ran it — did it actually pass? loop-fix up to 3x
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
**The idea in one line: don't ask the model to guess — hand it the path.**
|
|
153
|
+
Each layer terminates one decision the model would otherwise hallucinate.
|
|
154
|
+
Relevant > complete — inject the *right* context, never *all* of it.
|
|
155
|
+
|
|
156
|
+
---
|
|
157
|
+
|
|
158
|
+
## Install
|
|
159
|
+
|
|
160
|
+
```bash
|
|
161
|
+
pip install simplicio-cli # from PyPI
|
|
162
|
+
# or
|
|
163
|
+
pip install -e . # from this repo
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
## Configure — any LLM, nothing hardcoded
|
|
167
|
+
|
|
168
|
+
| Provider | SIMPLICIO_MODEL | SIMPLICIO_BASE_URL |
|
|
169
|
+
|---|---|---|
|
|
170
|
+
| OpenRouter | `anthropic/claude-opus-4` | `https://openrouter.ai/api/v1` |
|
|
171
|
+
| GLM (z.ai) | `glm-4.6` | `https://api.z.ai/api/paas/v4` |
|
|
172
|
+
| DeepSeek | `deepseek-chat` | `https://api.deepseek.com` |
|
|
173
|
+
| OpenAI | `gpt-4.1` | `https://api.openai.com/v1` |
|
|
174
|
+
| Local (Ollama) | `llama3` | `http://localhost:11434/v1` |
|
|
175
|
+
| Anthropic native | `claude-opus-4-7` | *(leave unset)* |
|
|
176
|
+
|
|
177
|
+
If `SIMPLICIO_BASE_URL` is unset and the key is `ANTHROPIC_API_KEY`, it uses the
|
|
178
|
+
native Anthropic SDK. Otherwise it uses an OpenAI-compatible client pointed at
|
|
179
|
+
your `base_url` — so **any** OpenAI-like provider works without code changes.
|
|
180
|
+
|
|
181
|
+
```bash
|
|
182
|
+
simplicio smoke # prints provider config + one test call
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
## Use
|
|
186
|
+
|
|
187
|
+
```bash
|
|
188
|
+
# index once (caches embeddings; re-run after big changes)
|
|
189
|
+
simplicio index --stack angular
|
|
190
|
+
|
|
191
|
+
# run a task
|
|
192
|
+
simplicio task "hide Delete button for non-admins" \
|
|
193
|
+
--stack angular \
|
|
194
|
+
--target src/app/screen/screen.component.html \
|
|
195
|
+
--criteria "- no admin perm: button absent from DOM
|
|
196
|
+
- with admin perm: button present" \
|
|
197
|
+
--constraints "- don't touch save flow
|
|
198
|
+
- build passes"
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
Each `task`: precedent (from cache) → skill match → 6 layers → LLM generates
|
|
202
|
+
(diff + test + Playwright) → apply → run `SIMPLICIO_TEST_CMD` → pass? **done** :
|
|
203
|
+
send the error back → fix → retry (up to 3x).
|
|
204
|
+
|
|
205
|
+
---
|
|
206
|
+
|
|
207
|
+
## Cache — why it doesn't re-map every time
|
|
208
|
+
|
|
209
|
+
Embeddings are keyed by **content hash**, stored in `.simplicio/`. Unchanged
|
|
210
|
+
code block → vector reused. Change one file → only that block re-embeds.
|
|
211
|
+
|
|
212
|
+
| Run | Blocks embedded | Time |
|
|
213
|
+
|---|---|---|
|
|
214
|
+
| 1st (cold cache) | 3 | ~baseline |
|
|
215
|
+
| 2nd (no change) | **0** | **~instant** |
|
|
216
|
+
| after editing 1 file | **1** | partial |
|
|
217
|
+
|
|
218
|
+
---
|
|
219
|
+
|
|
220
|
+
## Benchmark — reproduce in 30 seconds
|
|
221
|
+
|
|
222
|
+
```bash
|
|
223
|
+
OPENROUTER_API_KEY=… \
|
|
224
|
+
BENCH_MODELS="deepseek/deepseek-v4-pro,qwen/qwen3.7-max,moonshotai/kimi-k2.6,openai/gpt-5.5,anthropic/claude-opus-4.7,google/gemini-3.5-flash" \
|
|
225
|
+
python3 bench/run_offline.py
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
No project required, stdlib only, deterministic regex scoring — no LLM judges
|
|
229
|
+
the LLM. Each case runs twice on the **same** model: raw one-line objective vs
|
|
230
|
+
simplicio's 6-layer contract. Outputs scored on target-file mention, DIFF
|
|
231
|
+
block, TEST block, contract-state words. Full numbers in [`bench/results.md`](bench/results.md).
|
|
232
|
+
|
|
233
|
+
### Full harness (your real project, your real tests)
|
|
234
|
+
|
|
235
|
+
```bash
|
|
236
|
+
simplicio bench --cases bench/cases.json --stack angular
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
Runs each case two ways and runs **your real test command** (e.g. `ng test
|
|
240
|
+
--watch=false`) on each output. Writes the true pass-rate to
|
|
241
|
+
[`bench/results.md`](bench/results.md).
|
|
242
|
+
|
|
243
|
+
### 4-quadrant bench — agent × simplicio matrix
|
|
244
|
+
|
|
245
|
+
Adds the second axis: not just *"does the 6-layer wrap help one call?"* but
|
|
246
|
+
*"does it still help inside a retry loop?"*. Same model, same cases — only
|
|
247
|
+
the cell logic changes.
|
|
248
|
+
|
|
249
|
+
| | **no simplicio** | **with simplicio** |
|
|
250
|
+
| ----------------------- | ------------------------ | ------------------------ |
|
|
251
|
+
| **no agent** (1 call) | Q1 — baseline | Q2 — current bench |
|
|
252
|
+
| **with agent** (loop) | Q3 — loop only | Q4 — composition |
|
|
253
|
+
|
|
254
|
+
```bash
|
|
255
|
+
pip install -e ".[bench]" # adds fpdf2 for PDF report
|
|
256
|
+
OPENROUTER_API_KEY=… \
|
|
257
|
+
BENCH_MODELS="google/gemma-3-4b-it" \
|
|
258
|
+
BENCH_MAX_ITERS=3 \
|
|
259
|
+
python3 bench/run_4quadrant.py
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
Outputs `bench/results_4quadrant.{md,pdf,json}` + SVG charts under
|
|
263
|
+
`bench/charts/4q_*.svg` + per-iteration raw outputs under
|
|
264
|
+
`.simplicio/bench_4q/<model>/case_NN/q*_iter*.txt`. Methodology and
|
|
265
|
+
hypothesis decomposition: [`docs/benchmark-4quadrant.md`](docs/benchmark-4quadrant.md).
|
|
266
|
+
|
|
267
|
+
The matrix decomposes:
|
|
268
|
+
|
|
269
|
+
- **Prompt effect alone**: Q2 − Q1
|
|
270
|
+
- **Loop effect alone**: Q3 − Q1
|
|
271
|
+
- **Prompt effect inside loop**: Q4 − Q3 (does simplicio still matter once you loop?)
|
|
272
|
+
- **Composition gain over best single axis**: Q4 − max(Q2, Q3)
|
|
273
|
+
- **Synergy vs linear stacking**: Q4 − (Q1 + (Q2−Q1) + (Q3−Q1))
|
|
274
|
+
|
|
275
|
+
#### Run 1 — focused single-model, `google/gemma-3-4b-it`, 5 cases, max_iters=3 (2026-05-26)
|
|
276
|
+
|
|
277
|
+
| Quadrant | Prompt | Execution | Pass rate | Avg iters | Tokens / pass |
|
|
278
|
+
|---|---|---|---|---|---|
|
|
279
|
+
| **Q1** | raw goal | 1-shot | **0/5 (0%)** | 1.00 | 4,683 |
|
|
280
|
+
| **Q2** | simplicio 6-layer | 1-shot | **3/5 (60%)** | 1.00 | 800 |
|
|
281
|
+
| **Q3** | raw goal | loop w/ feedback | **2/5 (40%)** | 3.00 | 3,135 |
|
|
282
|
+
| **Q4** | simplicio 6-layer | loop w/ feedback | **4/5 (80%)** | 1.80 | 1,018 |
|
|
283
|
+
|
|
284
|
+
Decomposition (rejection threshold `|Δ| ≥ 5 pts`):
|
|
285
|
+
|
|
286
|
+
| Hypothesis | Δ | Verdict |
|
|
287
|
+
|---|---|---|
|
|
288
|
+
| Loop alone closes the gap (simplicio unnecessary once you loop) | Q4 − Q3 = **+40 pts** | **rejected** |
|
|
289
|
+
| Simplicio alone is enough (loop is overkill) | Q4 − Q2 = **+20 pts** | **rejected** |
|
|
290
|
+
| Gains stack linearly (no synergy) | Q4 − linear = **−20 pts** | **rejected** |
|
|
291
|
+
|
|
292
|
+
Cost per passing case: Q1 = 4,683 tok / 236s — Q2 = **800 tok / 21s** — Q3 = 3,135 tok / 109s — Q4 = **1,018 tok / 20s**. Full table + charts in [`bench/results_4quadrant.md`](bench/results_4quadrant.md).
|
|
293
|
+
|
|
294
|
+
#### Run 2 — wider multi-model, 3 models × 10 cases (partial), max_iters=5 (2026-05-26)
|
|
295
|
+
|
|
296
|
+
Replicated the matrix across more models and more cases. `qwen-2.5-7b` covers only the first 5 of 10 cases (wide run was killed mid-execution); `claude-3.5-haiku` not reached. Aggregate counts every observed `(model × case × quadrant)` tuple as one observation:
|
|
297
|
+
|
|
298
|
+
| Quadrant | Prompt | Execution | Pass rate | Avg iters | Tokens / pass | ms / pass |
|
|
299
|
+
|---|---|---|---|---|---|---|
|
|
300
|
+
| **Q1** | raw goal | 1-shot | **0/25 (0%)** | 1.00 | 22,387 | 817,437 |
|
|
301
|
+
| **Q2** | simplicio 6-layer | 1-shot | **16/25 (64%)** | 1.00 | 1,093 | 14,797 |
|
|
302
|
+
| **Q3** | raw goal | loop w/ feedback | **11/25 (44%)** | 4.00 | 7,154 | 106,382 |
|
|
303
|
+
| **Q4** | simplicio 6-layer | loop w/ feedback | **19/25 (76%)** | 2.44 | 1,914 | 24,170 |
|
|
304
|
+
|
|
305
|
+
Per-model breakdown:
|
|
306
|
+
|
|
307
|
+
| Model | Cases | Q1 | Q2 | Q3 | Q4 |
|
|
308
|
+
|---|---|---|---|---|---|
|
|
309
|
+
| `google/gemma-3-4b-it` | 10/10 | 0/10 (0%) | 7/10 (70%) | 4/10 (40%) | **8/10 (80%)** |
|
|
310
|
+
| `meta-llama/llama-3.2-3b-instruct` | 10/10 | 0/10 (0%) | 5/10 (50%) | 4/10 (40%) | **6/10 (60%)** |
|
|
311
|
+
| `qwen/qwen-2.5-7b-instruct` | 5/10 | 0/5 (0%) | 4/5 (80%) | 3/5 (60%) | **5/5 (100%)** |
|
|
312
|
+
|
|
313
|
+
Decomposition (rejection threshold `|Δ| ≥ 5 pts`):
|
|
314
|
+
|
|
315
|
+
| Hypothesis | Δ | Verdict |
|
|
316
|
+
|---|---|---|
|
|
317
|
+
| Loop alone closes the gap (simplicio unnecessary once you loop) | Q4 − Q3 = **+32 pts** | **rejected** |
|
|
318
|
+
| Simplicio alone is enough (loop is overkill) | Q4 − Q2 = **+12 pts** | **rejected** |
|
|
319
|
+
| Gains stack linearly (no synergy) | Q4 − linear = **−32 pts** | **rejected** |
|
|
320
|
+
|
|
321
|
+
Same picture at every scale: Q4 (composition) wins on pass-rate, **and** Q4 stays close to Q2 on cost (1.9k tok / 24s per pass vs. Q2's 1.1k / 15s) while Q3 burns 7.2k tok / 106s per pass for fewer passes. Full table + per-case breakdown in [`bench/results_4quadrant_wide.md`](bench/results_4quadrant_wide.md).
|
|
322
|
+
|
|
323
|
+
---
|
|
324
|
+
|
|
325
|
+
## Plug points (stubs marked in code)
|
|
326
|
+
|
|
327
|
+
| File | Replace with |
|
|
328
|
+
|---|---|
|
|
329
|
+
| `prompt.py::_mapper` | your real **llm-project-mapper** |
|
|
330
|
+
| `pipeline.py::_aplicar_e_testar` | extract diff → `git apply` → parse test result |
|
|
331
|
+
| `skill_router.py` | point `SIMPLICIO_SKILLS_DIR` at your mapper's skills |
|
|
332
|
+
|
|
333
|
+
## Layout
|
|
334
|
+
|
|
335
|
+
```
|
|
336
|
+
simplicio/
|
|
337
|
+
cli.py # index | task | bench | smoke
|
|
338
|
+
cache.py # content-hash embedding cache
|
|
339
|
+
precedent.py # grep + semantic rank (uses cache)
|
|
340
|
+
skill_router.py # picks the ONE matching skill
|
|
341
|
+
prompt.py # stacks the 6 layers
|
|
342
|
+
providers.py # any OpenAI-compatible endpoint + Anthropic native
|
|
343
|
+
pipeline.py # generate → test → fix loop
|
|
344
|
+
bench.py # with-vs-without harness
|
|
345
|
+
templates/simplicio_prompt.md
|
|
346
|
+
bench/
|
|
347
|
+
run_offline.py # stdlib-only multi-model benchmark
|
|
348
|
+
cases.json # your benchmark tasks
|
|
349
|
+
cases_offline.json
|
|
350
|
+
results.md # filled by `simplicio bench` / `run_offline.py`
|
|
351
|
+
charts/ # SVG: overall, delta, by_case, by_stack
|
|
352
|
+
```
|
|
353
|
+
|
|
354
|
+
## License
|
|
355
|
+
MIT
|
|
@@ -0,0 +1,318 @@
|
|
|
1
|
+
# simplicio-cli
|
|
2
|
+
|
|
3
|
+
**Your tasks with 99% accuracy using any LLM (Claude, DeepSeek, Codex, Gemini, Hermes, OpenClaw, Cursor).**
|
|
4
|
+
|
|
5
|
+
[](https://pypi.org/project/simplicio-cli/)
|
|
6
|
+
[](https://pypi.org/project/simplicio-cli/)
|
|
7
|
+
[](LICENSE)
|
|
8
|
+
|
|
9
|
+
[](output/imagegen/simplicio-cli-readme-hero.png)
|
|
10
|
+
|
|
11
|
+
> *"hide the Delete button for non-admins"* → diff + test + applied + verified.
|
|
12
|
+
> Works with **OpenRouter, OpenAI, Anthropic, GLM, DeepSeek, Ollama** — one env var.
|
|
13
|
+
|
|
14
|
+
```bash
|
|
15
|
+
pip install simplicio-cli
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Why it works — the numbers
|
|
21
|
+
|
|
22
|
+
Same model. Same task. Only the prompt changes. **Measured, reproducible, deterministic.**
|
|
23
|
+
**Fourteen models tested across three runs** — five sub-4B tiny models, six
|
|
24
|
+
frontier 2026 models, and three mid-tier 7B–12B open models. Every one gained
|
|
25
|
+
at least **+14 points** when wrapped in simplicio's 6-layer contract.
|
|
26
|
+
|
|
27
|
+
#### Tiny models — sub-4B, run on 2026-05-26 (50 runs/side, 260 checks)
|
|
28
|
+
|
|
29
|
+
| Model | Without simplicio | With simplicio | Gain |
|
|
30
|
+
|---|---|---|---|
|
|
31
|
+
| **Gemma 3 4B** (`google/gemma-3-4b-it`) | 38% | **96%** | **+58 pts** |
|
|
32
|
+
| **Llama 3.2 3B** (`meta-llama/llama-3.2-3b-instruct`) | 28% | **73%** | **+45 pts** |
|
|
33
|
+
| **Gemma 3n e4B** (`google/gemma-3n-e4b-it`) | 44% | **88%** | **+44 pts** |
|
|
34
|
+
| **Phi-4 mini** (`microsoft/phi-4-mini-instruct`) | 36% | **73%** | **+37 pts** |
|
|
35
|
+
| **Llama 3.2 1B** (`meta-llama/llama-3.2-1b-instruct`) | 26% | **40%** | **+14 pts** |
|
|
36
|
+
| **Tiny avg (5 models · 10 cases · 260 checks)** | **35%** | **74%** | **+39 pts (+112%)** |
|
|
37
|
+
|
|
38
|
+
> **Not hosted on OpenRouter** (requested but skipped): Gemma 3 270M, Gemma 3 1B,
|
|
39
|
+
> Gemma 2 2B, Qwen3 0.6B, Qwen3 1.7B, Qwen2.5 0.5B, Qwen2.5 1.5B, Qwen 3B,
|
|
40
|
+
> Nemotron Nano 4B (OR's smallest Nemotron is 9B). Sub-4B substitutes used above.
|
|
41
|
+
> simplicio still gains **+14 to +58 points** even on a 1B-param model.
|
|
42
|
+
|
|
43
|
+
#### Frontier 2026 models — run on 2026-05-26 (60 runs/side, 312 checks)
|
|
44
|
+
|
|
45
|
+
| Model | Without simplicio | With simplicio | Gain |
|
|
46
|
+
|---|---|---|---|
|
|
47
|
+
| **GPT-5.5** (`openai/gpt-5.5`) | 38% | **100%** | **+62 pts** |
|
|
48
|
+
| **Kimi K2.6** (`moonshotai/kimi-k2.6`) | 40% | **100%** | **+60 pts** |
|
|
49
|
+
| **Gemini 3.5 Flash** (`google/gemini-3.5-flash`) | 42% | **100%** | **+58 pts** |
|
|
50
|
+
| **Qwen 3.7 Max** (`qwen/qwen3.7-max`) | 44% | **100%** | **+56 pts** |
|
|
51
|
+
| **Claude Opus 4.7** (`anthropic/claude-opus-4.7`) | 42% | **98%** | **+56 pts** |
|
|
52
|
+
| **DeepSeek V4 Pro** (`deepseek/deepseek-v4-pro`) | 44% | **96%** | **+52 pts** |
|
|
53
|
+
| **Frontier avg (6 models · 10 cases · 312 checks)** | **41%** | **99%** | **+58 pts (+136%)** |
|
|
54
|
+
|
|
55
|
+
#### Mid-tier 7B–12B open models — earlier run (v0.2.2, 30 runs/side, 156 checks)
|
|
56
|
+
|
|
57
|
+
| Model | Without simplicio | With simplicio | Gain |
|
|
58
|
+
|---|---|---|---|
|
|
59
|
+
| **Gemma 3 12B** (`google/gemma-3-12b-it`) | 34% | **92%** | **+58 pts** |
|
|
60
|
+
| **Llama 3.1 8B** (`meta-llama/llama-3.1-8b-instruct`) | 36% | **90%** | **+54 pts** |
|
|
61
|
+
| **Qwen 2.5 7B** (`qwen/qwen-2.5-7b-instruct`) | 34% | **88%** | **+54 pts** |
|
|
62
|
+
| **Mid-tier avg (3 models · 10 cases · 156 checks)** | **35%** | **90%** | **+55 pts (+156%)** |
|
|
63
|
+
|
|
64
|
+
> **Across all 14 models tested across three runs**, the average gain is **+51
|
|
65
|
+
> points**. Smallest: **+14 pts** (Llama 3.2 1B — the contract still moves a
|
|
66
|
+
> 1B-param model). Largest: **+62 pts** (GPT-5.5). The contract helps tiny
|
|
67
|
+
> sub-4B models, frontier reasoning models, and mid-tier 7B–12B alike — five
|
|
68
|
+
> of the six frontier models hit **100% pass-rate**.
|
|
69
|
+
|
|
70
|
+
### Output-quality signals (rate across all 60 frontier runs)
|
|
71
|
+
|
|
72
|
+
| Signal | Raw prompt | With simplicio |
|
|
73
|
+
|---|---|---|
|
|
74
|
+
| **DIFF block present** | 36% | **98%** |
|
|
75
|
+
| Target file mentioned | 1% | **100%** |
|
|
76
|
+
| TEST block present | 88% | **98%** |
|
|
77
|
+
|
|
78
|
+
### Cost — tokens & wall-clock (measured, not estimated)
|
|
79
|
+
|
|
80
|
+
Same provider, same models, same cases. Token counts pulled from the API
|
|
81
|
+
`usage` field; latency from `time.perf_counter()` around each call.
|
|
82
|
+
|
|
83
|
+
| Side | Tokens / run | Wall-clock / run | Total tokens (60 runs) | Total time |
|
|
84
|
+
|---|---|---|---|---|
|
|
85
|
+
| Raw prompt | 1,967 | 46.1s | 118,040 | 46m 07s |
|
|
86
|
+
| With simplicio | **3,168** | **57.6s** | **190,119** | **57m 33s** |
|
|
87
|
+
| Δ | **+61%** | **+24%** | +72,079 | +11m 26s |
|
|
88
|
+
|
|
89
|
+
simplicio wraps the objective in a 6-layer contract — more input tokens up
|
|
90
|
+
front, longer completions because the model produces the full DIFF + TEST +
|
|
91
|
+
EVIDENCE the contract demands instead of a one-line guess. The bill goes up,
|
|
92
|
+
but so does the **pass-rate (41% → 99%)** and the **DIFF-block rate (36% → 98%)** —
|
|
93
|
+
useful tokens, not chat.
|
|
94
|
+
|
|
95
|
+
> Six frontier models — GPT-5.5, Kimi K2.6, Gemini 3.5 Flash, Qwen 3.7 Max,
|
|
96
|
+
> Claude Opus 4.7, DeepSeek V4 Pro — gained **+52 to +62 points** when wrapped
|
|
97
|
+
> in simplicio's 6-layer contract. Without changing the model. Without
|
|
98
|
+
> fine-tuning. Five of six landed at **100% pass-rate with simplicio**.
|
|
99
|
+
|
|
100
|
+
Full report: [`bench/results.md`](bench/results.md) · [`bench/results.pdf`](bench/results.pdf) · raw outputs under `.simplicio/bench_runs/`.
|
|
101
|
+
|
|
102
|
+
---
|
|
103
|
+
|
|
104
|
+
## How it works
|
|
105
|
+
|
|
106
|
+
```
|
|
107
|
+
mapper WHERE project structure + latest state
|
|
108
|
+
precedent HOW-1 the real snippet in THIS repo that already does it
|
|
109
|
+
skill-router HOW-2 the ONE mapper skill that matches (ranked, not all)
|
|
110
|
+
simplicio BUILD stacks the 6 layers into one prompt (cache-friendly)
|
|
111
|
+
test JUDGE contract written as testable states
|
|
112
|
+
verify PROOF ran it — did it actually pass? loop-fix up to 3x
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
**The idea in one line: don't ask the model to guess — hand it the path.**
|
|
116
|
+
Each layer terminates one decision the model would otherwise hallucinate.
|
|
117
|
+
Relevant > complete — inject the *right* context, never *all* of it.
|
|
118
|
+
|
|
119
|
+
---
|
|
120
|
+
|
|
121
|
+
## Install
|
|
122
|
+
|
|
123
|
+
```bash
|
|
124
|
+
pip install simplicio-cli # from PyPI
|
|
125
|
+
# or
|
|
126
|
+
pip install -e . # from this repo
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
## Configure — any LLM, nothing hardcoded
|
|
130
|
+
|
|
131
|
+
| Provider | SIMPLICIO_MODEL | SIMPLICIO_BASE_URL |
|
|
132
|
+
|---|---|---|
|
|
133
|
+
| OpenRouter | `anthropic/claude-opus-4` | `https://openrouter.ai/api/v1` |
|
|
134
|
+
| GLM (z.ai) | `glm-4.6` | `https://api.z.ai/api/paas/v4` |
|
|
135
|
+
| DeepSeek | `deepseek-chat` | `https://api.deepseek.com` |
|
|
136
|
+
| OpenAI | `gpt-4.1` | `https://api.openai.com/v1` |
|
|
137
|
+
| Local (Ollama) | `llama3` | `http://localhost:11434/v1` |
|
|
138
|
+
| Anthropic native | `claude-opus-4-7` | *(leave unset)* |
|
|
139
|
+
|
|
140
|
+
If `SIMPLICIO_BASE_URL` is unset and the key is `ANTHROPIC_API_KEY`, it uses the
|
|
141
|
+
native Anthropic SDK. Otherwise it uses an OpenAI-compatible client pointed at
|
|
142
|
+
your `base_url` — so **any** OpenAI-like provider works without code changes.
|
|
143
|
+
|
|
144
|
+
```bash
|
|
145
|
+
simplicio smoke # prints provider config + one test call
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
## Use
|
|
149
|
+
|
|
150
|
+
```bash
|
|
151
|
+
# index once (caches embeddings; re-run after big changes)
|
|
152
|
+
simplicio index --stack angular
|
|
153
|
+
|
|
154
|
+
# run a task
|
|
155
|
+
simplicio task "hide Delete button for non-admins" \
|
|
156
|
+
--stack angular \
|
|
157
|
+
--target src/app/screen/screen.component.html \
|
|
158
|
+
--criteria "- no admin perm: button absent from DOM
|
|
159
|
+
- with admin perm: button present" \
|
|
160
|
+
--constraints "- don't touch save flow
|
|
161
|
+
- build passes"
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
Each `task`: precedent (from cache) → skill match → 6 layers → LLM generates
|
|
165
|
+
(diff + test + Playwright) → apply → run `SIMPLICIO_TEST_CMD` → pass? **done** :
|
|
166
|
+
send the error back → fix → retry (up to 3x).
|
|
167
|
+
|
|
168
|
+
---
|
|
169
|
+
|
|
170
|
+
## Cache — why it doesn't re-map every time
|
|
171
|
+
|
|
172
|
+
Embeddings are keyed by **content hash**, stored in `.simplicio/`. Unchanged
|
|
173
|
+
code block → vector reused. Change one file → only that block re-embeds.
|
|
174
|
+
|
|
175
|
+
| Run | Blocks embedded | Time |
|
|
176
|
+
|---|---|---|
|
|
177
|
+
| 1st (cold cache) | 3 | ~baseline |
|
|
178
|
+
| 2nd (no change) | **0** | **~instant** |
|
|
179
|
+
| after editing 1 file | **1** | partial |
|
|
180
|
+
|
|
181
|
+
---
|
|
182
|
+
|
|
183
|
+
## Benchmark — reproduce in 30 seconds
|
|
184
|
+
|
|
185
|
+
```bash
|
|
186
|
+
OPENROUTER_API_KEY=… \
|
|
187
|
+
BENCH_MODELS="deepseek/deepseek-v4-pro,qwen/qwen3.7-max,moonshotai/kimi-k2.6,openai/gpt-5.5,anthropic/claude-opus-4.7,google/gemini-3.5-flash" \
|
|
188
|
+
python3 bench/run_offline.py
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
No project required, stdlib only, deterministic regex scoring — no LLM judges
|
|
192
|
+
the LLM. Each case runs twice on the **same** model: raw one-line objective vs
|
|
193
|
+
simplicio's 6-layer contract. Outputs scored on target-file mention, DIFF
|
|
194
|
+
block, TEST block, contract-state words. Full numbers in [`bench/results.md`](bench/results.md).
|
|
195
|
+
|
|
196
|
+
### Full harness (your real project, your real tests)
|
|
197
|
+
|
|
198
|
+
```bash
|
|
199
|
+
simplicio bench --cases bench/cases.json --stack angular
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
Runs each case two ways and runs **your real test command** (e.g. `ng test
|
|
203
|
+
--watch=false`) on each output. Writes the true pass-rate to
|
|
204
|
+
[`bench/results.md`](bench/results.md).
|
|
205
|
+
|
|
206
|
+
### 4-quadrant bench — agent × simplicio matrix
|
|
207
|
+
|
|
208
|
+
Adds the second axis: not just *"does the 6-layer wrap help one call?"* but
|
|
209
|
+
*"does it still help inside a retry loop?"*. Same model, same cases — only
|
|
210
|
+
the cell logic changes.
|
|
211
|
+
|
|
212
|
+
| | **no simplicio** | **with simplicio** |
|
|
213
|
+
| ----------------------- | ------------------------ | ------------------------ |
|
|
214
|
+
| **no agent** (1 call) | Q1 — baseline | Q2 — current bench |
|
|
215
|
+
| **with agent** (loop) | Q3 — loop only | Q4 — composition |
|
|
216
|
+
|
|
217
|
+
```bash
|
|
218
|
+
pip install -e ".[bench]" # adds fpdf2 for PDF report
|
|
219
|
+
OPENROUTER_API_KEY=… \
|
|
220
|
+
BENCH_MODELS="google/gemma-3-4b-it" \
|
|
221
|
+
BENCH_MAX_ITERS=3 \
|
|
222
|
+
python3 bench/run_4quadrant.py
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
Outputs `bench/results_4quadrant.{md,pdf,json}` + SVG charts under
|
|
226
|
+
`bench/charts/4q_*.svg` + per-iteration raw outputs under
|
|
227
|
+
`.simplicio/bench_4q/<model>/case_NN/q*_iter*.txt`. Methodology and
|
|
228
|
+
hypothesis decomposition: [`docs/benchmark-4quadrant.md`](docs/benchmark-4quadrant.md).
|
|
229
|
+
|
|
230
|
+
The matrix decomposes:
|
|
231
|
+
|
|
232
|
+
- **Prompt effect alone**: Q2 − Q1
|
|
233
|
+
- **Loop effect alone**: Q3 − Q1
|
|
234
|
+
- **Prompt effect inside loop**: Q4 − Q3 (does simplicio still matter once you loop?)
|
|
235
|
+
- **Composition gain over best single axis**: Q4 − max(Q2, Q3)
|
|
236
|
+
- **Synergy vs linear stacking**: Q4 − (Q1 + (Q2−Q1) + (Q3−Q1))
|
|
237
|
+
|
|
238
|
+
#### Run 1 — focused single-model, `google/gemma-3-4b-it`, 5 cases, max_iters=3 (2026-05-26)
|
|
239
|
+
|
|
240
|
+
| Quadrant | Prompt | Execution | Pass rate | Avg iters | Tokens / pass |
|
|
241
|
+
|---|---|---|---|---|---|
|
|
242
|
+
| **Q1** | raw goal | 1-shot | **0/5 (0%)** | 1.00 | 4,683 |
|
|
243
|
+
| **Q2** | simplicio 6-layer | 1-shot | **3/5 (60%)** | 1.00 | 800 |
|
|
244
|
+
| **Q3** | raw goal | loop w/ feedback | **2/5 (40%)** | 3.00 | 3,135 |
|
|
245
|
+
| **Q4** | simplicio 6-layer | loop w/ feedback | **4/5 (80%)** | 1.80 | 1,018 |
|
|
246
|
+
|
|
247
|
+
Decomposition (rejection threshold `|Δ| ≥ 5 pts`):
|
|
248
|
+
|
|
249
|
+
| Hypothesis | Δ | Verdict |
|
|
250
|
+
|---|---|---|
|
|
251
|
+
| Loop alone closes the gap (simplicio unnecessary once you loop) | Q4 − Q3 = **+40 pts** | **rejected** |
|
|
252
|
+
| Simplicio alone is enough (loop is overkill) | Q4 − Q2 = **+20 pts** | **rejected** |
|
|
253
|
+
| Gains stack linearly (no synergy) | Q4 − linear = **−20 pts** | **rejected** |
|
|
254
|
+
|
|
255
|
+
Cost per passing case: Q1 = 4,683 tok / 236s — Q2 = **800 tok / 21s** — Q3 = 3,135 tok / 109s — Q4 = **1,018 tok / 20s**. Full table + charts in [`bench/results_4quadrant.md`](bench/results_4quadrant.md).
|
|
256
|
+
|
|
257
|
+
#### Run 2 — wider multi-model, 3 models × 10 cases (partial), max_iters=5 (2026-05-26)
|
|
258
|
+
|
|
259
|
+
Replicated the matrix across more models and more cases. `qwen-2.5-7b` covers only the first 5 of 10 cases (wide run was killed mid-execution); `claude-3.5-haiku` not reached. Aggregate counts every observed `(model × case × quadrant)` tuple as one observation:
|
|
260
|
+
|
|
261
|
+
| Quadrant | Prompt | Execution | Pass rate | Avg iters | Tokens / pass | ms / pass |
|
|
262
|
+
|---|---|---|---|---|---|---|
|
|
263
|
+
| **Q1** | raw goal | 1-shot | **0/25 (0%)** | 1.00 | 22,387 | 817,437 |
|
|
264
|
+
| **Q2** | simplicio 6-layer | 1-shot | **16/25 (64%)** | 1.00 | 1,093 | 14,797 |
|
|
265
|
+
| **Q3** | raw goal | loop w/ feedback | **11/25 (44%)** | 4.00 | 7,154 | 106,382 |
|
|
266
|
+
| **Q4** | simplicio 6-layer | loop w/ feedback | **19/25 (76%)** | 2.44 | 1,914 | 24,170 |
|
|
267
|
+
|
|
268
|
+
Per-model breakdown:
|
|
269
|
+
|
|
270
|
+
| Model | Cases | Q1 | Q2 | Q3 | Q4 |
|
|
271
|
+
|---|---|---|---|---|---|
|
|
272
|
+
| `google/gemma-3-4b-it` | 10/10 | 0/10 (0%) | 7/10 (70%) | 4/10 (40%) | **8/10 (80%)** |
|
|
273
|
+
| `meta-llama/llama-3.2-3b-instruct` | 10/10 | 0/10 (0%) | 5/10 (50%) | 4/10 (40%) | **6/10 (60%)** |
|
|
274
|
+
| `qwen/qwen-2.5-7b-instruct` | 5/10 | 0/5 (0%) | 4/5 (80%) | 3/5 (60%) | **5/5 (100%)** |
|
|
275
|
+
|
|
276
|
+
Decomposition (rejection threshold `|Δ| ≥ 5 pts`):
|
|
277
|
+
|
|
278
|
+
| Hypothesis | Δ | Verdict |
|
|
279
|
+
|---|---|---|
|
|
280
|
+
| Loop alone closes the gap (simplicio unnecessary once you loop) | Q4 − Q3 = **+32 pts** | **rejected** |
|
|
281
|
+
| Simplicio alone is enough (loop is overkill) | Q4 − Q2 = **+12 pts** | **rejected** |
|
|
282
|
+
| Gains stack linearly (no synergy) | Q4 − linear = **−32 pts** | **rejected** |
|
|
283
|
+
|
|
284
|
+
Same picture at every scale: Q4 (composition) wins on pass-rate, **and** Q4 stays close to Q2 on cost (1.9k tok / 24s per pass vs. Q2's 1.1k / 15s) while Q3 burns 7.2k tok / 106s per pass for fewer passes. Full table + per-case breakdown in [`bench/results_4quadrant_wide.md`](bench/results_4quadrant_wide.md).
|
|
285
|
+
|
|
286
|
+
---
|
|
287
|
+
|
|
288
|
+
## Plug points (stubs marked in code)
|
|
289
|
+
|
|
290
|
+
| File | Replace with |
|
|
291
|
+
|---|---|
|
|
292
|
+
| `prompt.py::_mapper` | your real **llm-project-mapper** |
|
|
293
|
+
| `pipeline.py::_aplicar_e_testar` | extract diff → `git apply` → parse test result |
|
|
294
|
+
| `skill_router.py` | point `SIMPLICIO_SKILLS_DIR` at your mapper's skills |
|
|
295
|
+
|
|
296
|
+
## Layout
|
|
297
|
+
|
|
298
|
+
```
|
|
299
|
+
simplicio/
|
|
300
|
+
cli.py # index | task | bench | smoke
|
|
301
|
+
cache.py # content-hash embedding cache
|
|
302
|
+
precedent.py # grep + semantic rank (uses cache)
|
|
303
|
+
skill_router.py # picks the ONE matching skill
|
|
304
|
+
prompt.py # stacks the 6 layers
|
|
305
|
+
providers.py # any OpenAI-compatible endpoint + Anthropic native
|
|
306
|
+
pipeline.py # generate → test → fix loop
|
|
307
|
+
bench.py # with-vs-without harness
|
|
308
|
+
templates/simplicio_prompt.md
|
|
309
|
+
bench/
|
|
310
|
+
run_offline.py # stdlib-only multi-model benchmark
|
|
311
|
+
cases.json # your benchmark tasks
|
|
312
|
+
cases_offline.json
|
|
313
|
+
results.md # filled by `simplicio bench` / `run_offline.py`
|
|
314
|
+
charts/ # SVG: overall, delta, by_case, by_stack
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
## License
|
|
318
|
+
MIT
|