simplicio-cli 0.4.0__tar.gz → 0.4.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (32) hide show
  1. {simplicio_cli-0.4.0/simplicio_cli.egg-info → simplicio_cli-0.4.2}/PKG-INFO +60 -14
  2. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/README.md +57 -11
  3. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/pyproject.toml +3 -3
  4. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/cli.py +35 -5
  5. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/pipeline.py +95 -11
  6. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2/simplicio_cli.egg-info}/PKG-INFO +60 -14
  7. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio_cli.egg-info/requires.txt +2 -2
  8. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/LICENSE +0 -0
  9. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/setup.cfg +0 -0
  10. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/__init__.py +0 -0
  11. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/adaptive.py +0 -0
  12. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/bench.py +0 -0
  13. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/cache.py +0 -0
  14. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/detect.py +0 -0
  15. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/init.py +0 -0
  16. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/mapper.py +0 -0
  17. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/observability.py +0 -0
  18. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/precedent.py +0 -0
  19. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/prompt.py +0 -0
  20. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/providers.py +0 -0
  21. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/skill_router.py +0 -0
  22. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/templates/SKILL.md +0 -0
  23. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/templates/simplicio_prompt.md +0 -0
  24. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/templates/userpromptsubmit-hook.sh +0 -0
  25. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/utils/__init__.py +0 -0
  26. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/utils/cache.py +0 -0
  27. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/utils/http_client.py +0 -0
  28. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio/utils/serialization.py +0 -0
  29. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio_cli.egg-info/SOURCES.txt +0 -0
  30. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio_cli.egg-info/dependency_links.txt +0 -0
  31. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio_cli.egg-info/entry_points.txt +0 -0
  32. {simplicio_cli-0.4.0 → simplicio_cli-0.4.2}/simplicio_cli.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: simplicio-cli
3
- Version: 0.4.0
3
+ Version: 0.4.2
4
4
  Summary: Portable task-to-code pipeline that works with any LLM. Turn a one-line task into a verified code change — diff + test + verify loop. +55 pts on a 156-check benchmark, 21% faster, ~same tokens.
5
5
  Author-email: Wesley Simplicio <wesleybob4@gmail.com>
6
6
  License: MIT
@@ -31,8 +31,8 @@ Requires-Dist: sentence-transformers>=2.2
31
31
  Requires-Dist: numpy>=1.23
32
32
  Requires-Dist: anthropic>=0.30
33
33
  Requires-Dist: openai>=1.30
34
- Requires-Dist: simplicio-mapper>=0.5.0
35
- Requires-Dist: simplicio-prompt>=1.7.0
34
+ Requires-Dist: simplicio-mapper>=0.6.0
35
+ Requires-Dist: simplicio-prompt>=1.9.0
36
36
  Requires-Dist: httpx>=0.27
37
37
  Requires-Dist: orjson>=3.10
38
38
  Requires-Dist: diskcache>=5.6
@@ -134,12 +134,25 @@ M1 MacBook (8 GB), five sub-4B tiny models, six frontier 2026 models, and three
134
134
  mid-tier 7B–12B open models. Every one gained at least **+14 points** when
135
135
  wrapped in simplicio's 6-layer contract.
136
136
 
137
- #### Hugging Face — Qwen2.5-Coder, re-run on 2026-05-27 (latest mapper, 10 cases/side, 156 checks)
137
+ #### Hugging Face — recommended Qwen3-Coder defaults (HF router)
138
138
 
139
- First batch of the smaller→larger re-benchmark against the latest
140
- `simplicio-mapper` artifacts. The 1.5B runs on CPU via `transformers`
141
- (Hugging Face Inference Providers does not serve it); the 3B and 7B run
142
- through the HF router (`https://router.huggingface.co/v1`).
139
+ The served Qwen Coder recommendation now uses the Qwen3-Coder MoE family.
140
+ `Qwen/Qwen2.5-Coder-3B-Instruct` and
141
+ `Qwen/Qwen2.5-Coder-7B-Instruct` remain available as legacy fallback models for
142
+ historical comparisons and hardware that cannot host the MoE successors.
143
+
144
+ | Slot | Recommended model | Route | Notes |
145
+ |---|---|---|---|
146
+ | Efficient coder | `Qwen/Qwen3-Coder-30B-A3B-Instruct` | HF router | 30B total / ~3B active MoE successor to the 3B slot |
147
+ | High-ceiling coder | `Qwen/Qwen3-Coder-Next` | HF router | 80B total / ~3B active MoE successor to the 7B slot |
148
+
149
+ > Reproduce the new default set:
150
+ > `BENCH_BASE_URL=https://router.huggingface.co/v1 BENCH_API_KEY=<hf-token>
151
+ > BENCH_MODELS="Qwen/Qwen3-Coder-30B-A3B-Instruct,Qwen/Qwen3-Coder-Next"
152
+ > python3 bench/run_offline.py`.
153
+
154
+ Legacy Qwen2.5-Coder baseline, re-run on 2026-05-27 against the latest
155
+ `simplicio-mapper` artifacts (10 cases/side, 156 checks):
143
156
 
144
157
  | Model | Without simplicio | With simplicio | Gain |
145
158
  |---|---|---|---|
@@ -148,10 +161,9 @@ through the HF router (`https://router.huggingface.co/v1`).
148
161
  | **Qwen 2.5 Coder 1.5B** (`Qwen/Qwen2.5-Coder-1.5B-Instruct`, local CPU) | 30% | **92%** | **+62 pts** |
149
162
  | **HF avg (3 models · 10 cases · 156 checks)** | **34%** | **94%** | **+60 pts (+172%)** |
150
163
 
151
- > Monotonic from smaller to larger: pass-rate with simplicio climbs **92% →
152
- > 94% → 96%** as the model grows, while the raw-prompt baseline stays at
153
- > **30–38%**. The 1.5B model gains the most (**+62 pts**) — the contract does
154
- > the heaviest lifting where the model is weakest. Reproduce:
164
+ > Monotonic from smaller to larger in the legacy baseline: pass-rate with
165
+ > simplicio climbs **92% → 94% → 96%** as the model grows, while the raw-prompt
166
+ > baseline stays at **30–38%**. Reproduce the legacy set:
155
167
  > `BENCH_BASE_URL=https://router.huggingface.co/v1 BENCH_API_KEY=<hf-token>
156
168
  > BENCH_MODELS="local:Qwen/Qwen2.5-Coder-1.5B-Instruct,Qwen/Qwen2.5-Coder-3B-Instruct,Qwen/Qwen2.5-Coder-7B-Instruct"
157
169
  > python3 bench/run_offline.py`.
@@ -167,7 +179,18 @@ Pro) show `n/a` for the new column: their OpenRouter calls hit account-level
167
179
  HTTP 402 / provider failures on >50% of requests this round, so the sample is
168
180
  too small to publish; their old numbers still stand.
169
181
 
170
- #### Local offline — qwen2.5-coder on Ollama, M1 8 GB, run on 2026-05-27 (30 runs/side, 156 checks)
182
+ #### Local offline — Qwen3-Coder GGUF recommendation, Qwen2.5 legacy baseline
183
+
184
+ For local OpenAI-compatible servers, prefer the Qwen3-Coder GGUF builds when
185
+ the machine can host MoE weights:
186
+
187
+ | Slot | Recommended local weights | Notes |
188
+ |---|---|---|
189
+ | Efficient coder | `unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF` | Primary local successor for the 3B-active slot |
190
+ | High-ceiling coder | `unsloth/Qwen3-Coder-Next-GGUF` | 24 GB GPU-class successor for long-context work |
191
+
192
+ The last fully offline fallback baseline remains qwen2.5-coder on Ollama,
193
+ M1 8 GB, run on 2026-05-27 (30 runs/side, 156 checks):
171
194
 
172
195
  | Model | Without simplicio | With simplicio | Gain |
173
196
  |---|---|---|---|
@@ -180,7 +203,7 @@ too small to publish; their old numbers still stand.
180
203
  > `http://localhost:11434/v1` (Ollama's OpenAI-compatible endpoint). A
181
204
  > 1.5B-param model running on a 4-year-old laptop reaches **88%** pass-rate
182
205
  > with simplicio's contract — same hardware, same model, raw prompt = 32%.
183
- > Reproduce: `BENCH_BASE_URL=http://localhost:11434/v1 BENCH_API_KEY=ollama
206
+ > Reproduce the legacy fallback: `BENCH_BASE_URL=http://localhost:11434/v1 BENCH_API_KEY=ollama
184
207
  > BENCH_MODELS="qwen2.5-coder:7b" python3 bench/run_offline.py`.
185
208
 
186
209
  #### Tiny models — sub-4B, run on 2026-05-26 (50 runs/side, 260 checks)
@@ -382,6 +405,29 @@ simplicio task "..." --stack angular --target ...
382
405
 
383
406
  How it works: simplicio shells out to `claude -p "<prompt>"` (or `codex exec "<prompt>"`) as a subprocess, captures stdout, runs the test loop. The inner CLI authenticates via your existing OAuth session in `~/.claude/` or `~/.codex/`. simplicio sets `SIMPLICIO_HOOK_GUARD=1` in the subprocess env so the inner Claude Code session does **not** re-fire simplicio's own UserPromptSubmit hook (no infinite recursion).
384
407
 
408
+ For orchestrators such as SendSprint, `simplicio task` also has a structured
409
+ contract:
410
+
411
+ ```bash
412
+ simplicio task "hide Delete button for non-admins" \
413
+ --stack angular \
414
+ --target src/app/screen/screen.component.html \
415
+ --dry-run-task \
416
+ --json
417
+
418
+ simplicio task "front-only task" \
419
+ --stack angular \
420
+ --target src/app/screen/screen.component.html \
421
+ --bound-paths "src/app/**" \
422
+ --json
423
+ ```
424
+
425
+ `--dry-run-task` generates the would-be diff/test output without applying or
426
+ testing it. `--json` returns `{task_id, applied, files_changed, tokens_used,
427
+ cost_usd, diff_summary, warnings}`. Repeat `--bound-paths <glob>` to reject
428
+ diffs outside the allowed edit surface; violations are reported in `warnings`
429
+ and the command exits non-zero.
430
+
385
431
  ### Path 3 example — standalone with API key
386
432
 
387
433
  ```bash
@@ -92,12 +92,25 @@ M1 MacBook (8 GB), five sub-4B tiny models, six frontier 2026 models, and three
92
92
  mid-tier 7B–12B open models. Every one gained at least **+14 points** when
93
93
  wrapped in simplicio's 6-layer contract.
94
94
 
95
- #### Hugging Face — Qwen2.5-Coder, re-run on 2026-05-27 (latest mapper, 10 cases/side, 156 checks)
95
+ #### Hugging Face — recommended Qwen3-Coder defaults (HF router)
96
96
 
97
- First batch of the smaller→larger re-benchmark against the latest
98
- `simplicio-mapper` artifacts. The 1.5B runs on CPU via `transformers`
99
- (Hugging Face Inference Providers does not serve it); the 3B and 7B run
100
- through the HF router (`https://router.huggingface.co/v1`).
97
+ The served Qwen Coder recommendation now uses the Qwen3-Coder MoE family.
98
+ `Qwen/Qwen2.5-Coder-3B-Instruct` and
99
+ `Qwen/Qwen2.5-Coder-7B-Instruct` remain available as legacy fallback models for
100
+ historical comparisons and hardware that cannot host the MoE successors.
101
+
102
+ | Slot | Recommended model | Route | Notes |
103
+ |---|---|---|---|
104
+ | Efficient coder | `Qwen/Qwen3-Coder-30B-A3B-Instruct` | HF router | 30B total / ~3B active MoE successor to the 3B slot |
105
+ | High-ceiling coder | `Qwen/Qwen3-Coder-Next` | HF router | 80B total / ~3B active MoE successor to the 7B slot |
106
+
107
+ > Reproduce the new default set:
108
+ > `BENCH_BASE_URL=https://router.huggingface.co/v1 BENCH_API_KEY=<hf-token>
109
+ > BENCH_MODELS="Qwen/Qwen3-Coder-30B-A3B-Instruct,Qwen/Qwen3-Coder-Next"
110
+ > python3 bench/run_offline.py`.
111
+
112
+ Legacy Qwen2.5-Coder baseline, re-run on 2026-05-27 against the latest
113
+ `simplicio-mapper` artifacts (10 cases/side, 156 checks):
101
114
 
102
115
  | Model | Without simplicio | With simplicio | Gain |
103
116
  |---|---|---|---|
@@ -106,10 +119,9 @@ through the HF router (`https://router.huggingface.co/v1`).
106
119
  | **Qwen 2.5 Coder 1.5B** (`Qwen/Qwen2.5-Coder-1.5B-Instruct`, local CPU) | 30% | **92%** | **+62 pts** |
107
120
  | **HF avg (3 models · 10 cases · 156 checks)** | **34%** | **94%** | **+60 pts (+172%)** |
108
121
 
109
- > Monotonic from smaller to larger: pass-rate with simplicio climbs **92% →
110
- > 94% → 96%** as the model grows, while the raw-prompt baseline stays at
111
- > **30–38%**. The 1.5B model gains the most (**+62 pts**) — the contract does
112
- > the heaviest lifting where the model is weakest. Reproduce:
122
+ > Monotonic from smaller to larger in the legacy baseline: pass-rate with
123
+ > simplicio climbs **92% → 94% → 96%** as the model grows, while the raw-prompt
124
+ > baseline stays at **30–38%**. Reproduce the legacy set:
113
125
  > `BENCH_BASE_URL=https://router.huggingface.co/v1 BENCH_API_KEY=<hf-token>
114
126
  > BENCH_MODELS="local:Qwen/Qwen2.5-Coder-1.5B-Instruct,Qwen/Qwen2.5-Coder-3B-Instruct,Qwen/Qwen2.5-Coder-7B-Instruct"
115
127
  > python3 bench/run_offline.py`.
@@ -125,7 +137,18 @@ Pro) show `n/a` for the new column: their OpenRouter calls hit account-level
125
137
  HTTP 402 / provider failures on >50% of requests this round, so the sample is
126
138
  too small to publish; their old numbers still stand.
127
139
 
128
- #### Local offline — qwen2.5-coder on Ollama, M1 8 GB, run on 2026-05-27 (30 runs/side, 156 checks)
140
+ #### Local offline — Qwen3-Coder GGUF recommendation, Qwen2.5 legacy baseline
141
+
142
+ For local OpenAI-compatible servers, prefer the Qwen3-Coder GGUF builds when
143
+ the machine can host MoE weights:
144
+
145
+ | Slot | Recommended local weights | Notes |
146
+ |---|---|---|
147
+ | Efficient coder | `unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF` | Primary local successor for the 3B-active slot |
148
+ | High-ceiling coder | `unsloth/Qwen3-Coder-Next-GGUF` | 24 GB GPU-class successor for long-context work |
149
+
150
+ The last fully offline fallback baseline remains qwen2.5-coder on Ollama,
151
+ M1 8 GB, run on 2026-05-27 (30 runs/side, 156 checks):
129
152
 
130
153
  | Model | Without simplicio | With simplicio | Gain |
131
154
  |---|---|---|---|
@@ -138,7 +161,7 @@ too small to publish; their old numbers still stand.
138
161
  > `http://localhost:11434/v1` (Ollama's OpenAI-compatible endpoint). A
139
162
  > 1.5B-param model running on a 4-year-old laptop reaches **88%** pass-rate
140
163
  > with simplicio's contract — same hardware, same model, raw prompt = 32%.
141
- > Reproduce: `BENCH_BASE_URL=http://localhost:11434/v1 BENCH_API_KEY=ollama
164
+ > Reproduce the legacy fallback: `BENCH_BASE_URL=http://localhost:11434/v1 BENCH_API_KEY=ollama
142
165
  > BENCH_MODELS="qwen2.5-coder:7b" python3 bench/run_offline.py`.
143
166
 
144
167
  #### Tiny models — sub-4B, run on 2026-05-26 (50 runs/side, 260 checks)
@@ -340,6 +363,29 @@ simplicio task "..." --stack angular --target ...
340
363
 
341
364
  How it works: simplicio shells out to `claude -p "<prompt>"` (or `codex exec "<prompt>"`) as a subprocess, captures stdout, runs the test loop. The inner CLI authenticates via your existing OAuth session in `~/.claude/` or `~/.codex/`. simplicio sets `SIMPLICIO_HOOK_GUARD=1` in the subprocess env so the inner Claude Code session does **not** re-fire simplicio's own UserPromptSubmit hook (no infinite recursion).
342
365
 
366
+ For orchestrators such as SendSprint, `simplicio task` also has a structured
367
+ contract:
368
+
369
+ ```bash
370
+ simplicio task "hide Delete button for non-admins" \
371
+ --stack angular \
372
+ --target src/app/screen/screen.component.html \
373
+ --dry-run-task \
374
+ --json
375
+
376
+ simplicio task "front-only task" \
377
+ --stack angular \
378
+ --target src/app/screen/screen.component.html \
379
+ --bound-paths "src/app/**" \
380
+ --json
381
+ ```
382
+
383
+ `--dry-run-task` generates the would-be diff/test output without applying or
384
+ testing it. `--json` returns `{task_id, applied, files_changed, tokens_used,
385
+ cost_usd, diff_summary, warnings}`. Repeat `--bound-paths <glob>` to reject
386
+ diffs outside the allowed edit surface; violations are reported in `warnings`
387
+ and the command exits non-zero.
388
+
343
389
  ### Path 3 example — standalone with API key
344
390
 
345
391
  ```bash
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "simplicio-cli"
3
- version = "0.4.0"
3
+ version = "0.4.2"
4
4
  description = "Portable task-to-code pipeline that works with any LLM. Turn a one-line task into a verified code change — diff + test + verify loop. +55 pts on a 156-check benchmark, 21% faster, ~same tokens."
5
5
  readme = "README.md"
6
6
  license = { text = "MIT" }
@@ -45,8 +45,8 @@ dependencies = [
45
45
  "numpy>=1.23",
46
46
  "anthropic>=0.30",
47
47
  "openai>=1.30",
48
- "simplicio-mapper>=0.5.0",
49
- "simplicio-prompt>=1.7.0",
48
+ "simplicio-mapper>=0.6.0",
49
+ "simplicio-prompt>=1.9.0",
50
50
  "httpx>=0.27",
51
51
  "orjson>=3.10",
52
52
  "diskcache>=5.6",
@@ -12,6 +12,7 @@ first CLI use instead — the closest equivalent that works on every machine.
12
12
  from __future__ import annotations
13
13
 
14
14
  import argparse
15
+ import json
15
16
  import os
16
17
  import sys
17
18
  from pathlib import Path
@@ -27,7 +28,8 @@ def maybe_autoinstall(cmd: str | None) -> bool:
27
28
  return False
28
29
  if cmd in ("init", "detect"):
29
30
  return False
30
- claude_home = Path.home() / ".claude"
31
+ home = Path(os.environ["HOME"]) if os.environ.get("HOME") else Path.home()
32
+ claude_home = home / ".claude"
31
33
  if not claude_home.is_dir():
32
34
  return False
33
35
  hook_path = claude_home / "hooks" / "simplicio-userpromptsubmit.sh"
@@ -50,7 +52,7 @@ def maybe_autoinstall(cmd: str | None) -> bool:
50
52
  return False
51
53
 
52
54
 
53
- def main():
55
+ def main(argv=None):
54
56
  ap = argparse.ArgumentParser(prog="simplicio")
55
57
  sub = ap.add_subparsers(dest="cmd", required=True)
56
58
 
@@ -63,6 +65,12 @@ def main():
63
65
  pt.add_argument("--target", required=True)
64
66
  pt.add_argument("--criteria", default="- true state\n- false state")
65
67
  pt.add_argument("--constraints", default="- build passes")
68
+ pt.add_argument("--dry-run-task", action="store_true",
69
+ help="generate the would-be task output without applying/testing")
70
+ pt.add_argument("--json", action="store_true",
71
+ help="emit stable structured task output")
72
+ pt.add_argument("--bound-paths", action="append", default=[],
73
+ help="glob limiting which paths the task may change; repeatable")
66
74
 
67
75
 
68
76
  pb = sub.add_parser("bench", help="compare with vs without (real numbers)")
@@ -81,7 +89,7 @@ def main():
81
89
  p_det.add_argument("--quiet", action="store_true")
82
90
  p_det.add_argument("--json", action="store_true")
83
91
 
84
- a = ap.parse_args()
92
+ a = ap.parse_args(argv)
85
93
  maybe_autoinstall(a.cmd)
86
94
  if a.cmd == "index":
87
95
  from .precedent import index_repo
@@ -113,8 +121,30 @@ def main():
113
121
  argv += ["--json"]
114
122
  return detect_main(argv)
115
123
  else:
116
- from .pipeline import run
117
- run(a.root, a.stack, a.goal, a.target, a.criteria, a.constraints)
124
+ from .pipeline import run, run_task
125
+ if a.json or a.dry_run_task:
126
+ result = run_task(
127
+ a.root,
128
+ a.stack,
129
+ a.goal,
130
+ a.target,
131
+ a.criteria,
132
+ a.constraints,
133
+ dry_run_task=a.dry_run_task,
134
+ bound_paths=a.bound_paths,
135
+ quiet=a.json,
136
+ )
137
+ if a.json:
138
+ print(json.dumps(result, sort_keys=True))
139
+ else:
140
+ status = "DRY-RUN" if a.dry_run_task else "DONE"
141
+ print(f"{status}: {result['diff_summary']}")
142
+ for warning in result["warnings"]:
143
+ print(f"warning: {warning}", file=sys.stderr)
144
+ return 0 if (a.dry_run_task or result["applied"]) else 1
145
+ run(a.root, a.stack, a.goal, a.target, a.criteria, a.constraints,
146
+ bound_paths=a.bound_paths)
147
+ return 0
118
148
 
119
149
  if __name__ == "__main__":
120
150
  main()
@@ -1,5 +1,6 @@
1
1
  """pipeline.py — build -> generate -> validate -> test -> fix (loop)."""
2
2
  from dataclasses import dataclass
3
+ import fnmatch
3
4
  import os, re, subprocess
4
5
  from .observability import estimate_tokens, log_run
5
6
  from .prompt import build_prompt
@@ -18,7 +19,40 @@ class FailureClassification:
18
19
  kind: str
19
20
  guidance: str
20
21
 
21
- def validate_generated_output(output):
22
+ def extract_changed_files(output):
23
+ text = output or ""
24
+ files = []
25
+ for match in re.finditer(r"^diff --git a/(.+?) b/(.+?)$", text, flags=re.M):
26
+ files.append(match.group(2).strip())
27
+ for match in re.finditer(r"^\+\+\+ b/(.+?)$", text, flags=re.M):
28
+ files.append(match.group(1).strip())
29
+ return list(dict.fromkeys(f for f in files if f and f != "/dev/null"))
30
+
31
+ def _matches_bound(path, patterns):
32
+ normalized = path.replace(os.sep, "/").lstrip("./")
33
+ for raw in patterns or []:
34
+ pattern = str(raw).replace(os.sep, "/").lstrip("./")
35
+ if fnmatch.fnmatch(normalized, pattern):
36
+ return True
37
+ if pattern.endswith("/**"):
38
+ prefix = pattern[:-3].rstrip("/")
39
+ if normalized == prefix or normalized.startswith(f"{prefix}/"):
40
+ return True
41
+ return False
42
+
43
+ def _bound_path_warnings(files, bound_paths):
44
+ if not bound_paths:
45
+ return []
46
+ outside = [path for path in files if not _matches_bound(path, bound_paths)]
47
+ if not outside:
48
+ return []
49
+ return [
50
+ "diff touches path outside bound paths: "
51
+ + ", ".join(outside)
52
+ + f" (allowed: {', '.join(bound_paths)})"
53
+ ]
54
+
55
+ def validate_generated_output(output, bound_paths=None):
22
56
  text = output or ""
23
57
  hints = []
24
58
  has_diff = bool(re.search(r"^diff --git |^--- .+\n\+\+\+ ", text, flags=re.M))
@@ -29,6 +63,7 @@ def validate_generated_output(output):
29
63
  hints.append("include a TEST block or concrete test code")
30
64
  if re.search(r"(?i)\b(pseudocode|placeholder|todo: implement)\b", text):
31
65
  hints.append("replace placeholders with executable code")
66
+ hints.extend(_bound_path_warnings(extract_changed_files(output), bound_paths))
32
67
  return ValidationResult(
33
68
  ok=not hints,
34
69
  reason="ok" if not hints else "; ".join(hints),
@@ -64,10 +99,10 @@ def build_retry_feedback(attempt, validation=None, test_log=""):
64
99
  lines.append("Return the full corrected DIFF + TEST block only.")
65
100
  return "\n".join(lines)
66
101
 
67
- def _apply_and_test(output, root):
102
+ def _apply_and_test(output, root, bound_paths=None):
68
103
  os.makedirs(os.path.join(root, ".simplicio"), exist_ok=True)
69
104
  open(os.path.join(root, ".simplicio/last_output.txt"), "w").write(output or "")
70
- validation = validate_generated_output(output)
105
+ validation = validate_generated_output(output, bound_paths)
71
106
  if not validation.ok:
72
107
  return False, f"pre-apply validation failed: {validation.reason}"
73
108
  # PLUG: extract diff -> git apply; extract test. Here we run the test command.
@@ -75,13 +110,47 @@ def _apply_and_test(output, root):
75
110
  p = subprocess.run(cmd, shell=True, cwd=root, capture_output=True, text=True)
76
111
  return p.returncode == 0, (p.stdout + p.stderr)[-2000:]
77
112
 
78
- def run(root, stack, goal, target, criteria, constraints):
113
+ def _diff_summary(files_changed):
114
+ if not files_changed:
115
+ return "no changed files reported"
116
+ return "changed " + ", ".join(files_changed)
117
+
118
+ def _task_result(task_id, prompt, output, *, applied, warnings=None):
119
+ files_changed = extract_changed_files(output)
120
+ return {
121
+ "task_id": task_id,
122
+ "applied": bool(applied),
123
+ "files_changed": files_changed,
124
+ "tokens_used": {
125
+ "prompt": estimate_tokens(prompt),
126
+ "completion": estimate_tokens(output or ""),
127
+ },
128
+ "cost_usd": 0.0,
129
+ "diff_summary": _diff_summary(files_changed),
130
+ "warnings": warnings or [],
131
+ }
132
+
133
+ def run_task(root, stack, goal, target, criteria, constraints, *,
134
+ dry_run_task=False, bound_paths=None, quiet=False):
79
135
  prompt = build_prompt(root, stack, goal, target, criteria, constraints)
136
+ if dry_run_task:
137
+ output = generate(prompt)
138
+ validation = validate_generated_output(output, bound_paths)
139
+ warnings = [] if validation.ok else [validation.reason]
140
+ return _task_result(target, prompt, output, applied=False, warnings=warnings)
141
+
80
142
  feedback = None
143
+ last_output = ""
144
+ last_validation = None
145
+ last_log = ""
81
146
  for t in range(1, MAX_ATTEMPTS + 1):
82
- print(f"--- attempt {t} (provider={os.environ.get('SIMPLICIO_PROVIDER','claude')}) ---")
147
+ if not quiet:
148
+ print(f"--- attempt {t} (provider={os.environ.get('SIMPLICIO_PROVIDER','claude')}) ---")
83
149
  output = generate(prompt, feedback)
84
- ok, log = _apply_and_test(output, root)
150
+ last_output = output or ""
151
+ last_validation = validate_generated_output(output, bound_paths)
152
+ ok, log = _apply_and_test(output, root, bound_paths)
153
+ last_log = log
85
154
  log_run(root, {
86
155
  "mode": "pipeline",
87
156
  "attempt": t,
@@ -92,9 +161,24 @@ def run(root, stack, goal, target, criteria, constraints):
92
161
  "stack": stack,
93
162
  })
94
163
  if ok:
95
- print("PASSED the contract. DONE.")
96
- return output
97
- print("failed:", log[:300])
98
- feedback = build_retry_feedback(t + 1, validate_generated_output(output), log)
99
- print("attempts exhausted — manual review needed.")
164
+ if not quiet:
165
+ print("PASSED the contract. DONE.")
166
+ return _task_result(target, prompt, output, applied=True)
167
+ if not quiet:
168
+ print("failed:", log[:300])
169
+ feedback = build_retry_feedback(t + 1, last_validation, log)
170
+ if not quiet:
171
+ print("attempts exhausted — manual review needed.")
172
+ warnings = []
173
+ if last_validation and not last_validation.ok:
174
+ warnings.append(last_validation.reason)
175
+ elif last_log:
176
+ warnings.append(last_log[:500])
177
+ return _task_result(target, prompt, last_output, applied=False, warnings=warnings)
178
+
179
+ def run(root, stack, goal, target, criteria, constraints, bound_paths=None):
180
+ result = run_task(root, stack, goal, target, criteria, constraints,
181
+ bound_paths=bound_paths)
182
+ if result["applied"]:
183
+ return result
100
184
  return None
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: simplicio-cli
3
- Version: 0.4.0
3
+ Version: 0.4.2
4
4
  Summary: Portable task-to-code pipeline that works with any LLM. Turn a one-line task into a verified code change — diff + test + verify loop. +55 pts on a 156-check benchmark, 21% faster, ~same tokens.
5
5
  Author-email: Wesley Simplicio <wesleybob4@gmail.com>
6
6
  License: MIT
@@ -31,8 +31,8 @@ Requires-Dist: sentence-transformers>=2.2
31
31
  Requires-Dist: numpy>=1.23
32
32
  Requires-Dist: anthropic>=0.30
33
33
  Requires-Dist: openai>=1.30
34
- Requires-Dist: simplicio-mapper>=0.5.0
35
- Requires-Dist: simplicio-prompt>=1.7.0
34
+ Requires-Dist: simplicio-mapper>=0.6.0
35
+ Requires-Dist: simplicio-prompt>=1.9.0
36
36
  Requires-Dist: httpx>=0.27
37
37
  Requires-Dist: orjson>=3.10
38
38
  Requires-Dist: diskcache>=5.6
@@ -134,12 +134,25 @@ M1 MacBook (8 GB), five sub-4B tiny models, six frontier 2026 models, and three
134
134
  mid-tier 7B–12B open models. Every one gained at least **+14 points** when
135
135
  wrapped in simplicio's 6-layer contract.
136
136
 
137
- #### Hugging Face — Qwen2.5-Coder, re-run on 2026-05-27 (latest mapper, 10 cases/side, 156 checks)
137
+ #### Hugging Face — recommended Qwen3-Coder defaults (HF router)
138
138
 
139
- First batch of the smaller→larger re-benchmark against the latest
140
- `simplicio-mapper` artifacts. The 1.5B runs on CPU via `transformers`
141
- (Hugging Face Inference Providers does not serve it); the 3B and 7B run
142
- through the HF router (`https://router.huggingface.co/v1`).
139
+ The served Qwen Coder recommendation now uses the Qwen3-Coder MoE family.
140
+ `Qwen/Qwen2.5-Coder-3B-Instruct` and
141
+ `Qwen/Qwen2.5-Coder-7B-Instruct` remain available as legacy fallback models for
142
+ historical comparisons and hardware that cannot host the MoE successors.
143
+
144
+ | Slot | Recommended model | Route | Notes |
145
+ |---|---|---|---|
146
+ | Efficient coder | `Qwen/Qwen3-Coder-30B-A3B-Instruct` | HF router | 30B total / ~3B active MoE successor to the 3B slot |
147
+ | High-ceiling coder | `Qwen/Qwen3-Coder-Next` | HF router | 80B total / ~3B active MoE successor to the 7B slot |
148
+
149
+ > Reproduce the new default set:
150
+ > `BENCH_BASE_URL=https://router.huggingface.co/v1 BENCH_API_KEY=<hf-token>
151
+ > BENCH_MODELS="Qwen/Qwen3-Coder-30B-A3B-Instruct,Qwen/Qwen3-Coder-Next"
152
+ > python3 bench/run_offline.py`.
153
+
154
+ Legacy Qwen2.5-Coder baseline, re-run on 2026-05-27 against the latest
155
+ `simplicio-mapper` artifacts (10 cases/side, 156 checks):
143
156
 
144
157
  | Model | Without simplicio | With simplicio | Gain |
145
158
  |---|---|---|---|
@@ -148,10 +161,9 @@ through the HF router (`https://router.huggingface.co/v1`).
148
161
  | **Qwen 2.5 Coder 1.5B** (`Qwen/Qwen2.5-Coder-1.5B-Instruct`, local CPU) | 30% | **92%** | **+62 pts** |
149
162
  | **HF avg (3 models · 10 cases · 156 checks)** | **34%** | **94%** | **+60 pts (+172%)** |
150
163
 
151
- > Monotonic from smaller to larger: pass-rate with simplicio climbs **92% →
152
- > 94% → 96%** as the model grows, while the raw-prompt baseline stays at
153
- > **30–38%**. The 1.5B model gains the most (**+62 pts**) — the contract does
154
- > the heaviest lifting where the model is weakest. Reproduce:
164
+ > Monotonic from smaller to larger in the legacy baseline: pass-rate with
165
+ > simplicio climbs **92% → 94% → 96%** as the model grows, while the raw-prompt
166
+ > baseline stays at **30–38%**. Reproduce the legacy set:
155
167
  > `BENCH_BASE_URL=https://router.huggingface.co/v1 BENCH_API_KEY=<hf-token>
156
168
  > BENCH_MODELS="local:Qwen/Qwen2.5-Coder-1.5B-Instruct,Qwen/Qwen2.5-Coder-3B-Instruct,Qwen/Qwen2.5-Coder-7B-Instruct"
157
169
  > python3 bench/run_offline.py`.
@@ -167,7 +179,18 @@ Pro) show `n/a` for the new column: their OpenRouter calls hit account-level
167
179
  HTTP 402 / provider failures on >50% of requests this round, so the sample is
168
180
  too small to publish; their old numbers still stand.
169
181
 
170
- #### Local offline — qwen2.5-coder on Ollama, M1 8 GB, run on 2026-05-27 (30 runs/side, 156 checks)
182
+ #### Local offline — Qwen3-Coder GGUF recommendation, Qwen2.5 legacy baseline
183
+
184
+ For local OpenAI-compatible servers, prefer the Qwen3-Coder GGUF builds when
185
+ the machine can host MoE weights:
186
+
187
+ | Slot | Recommended local weights | Notes |
188
+ |---|---|---|
189
+ | Efficient coder | `unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF` | Primary local successor for the 3B-active slot |
190
+ | High-ceiling coder | `unsloth/Qwen3-Coder-Next-GGUF` | 24 GB GPU-class successor for long-context work |
191
+
192
+ The last fully offline fallback baseline remains qwen2.5-coder on Ollama,
193
+ M1 8 GB, run on 2026-05-27 (30 runs/side, 156 checks):
171
194
 
172
195
  | Model | Without simplicio | With simplicio | Gain |
173
196
  |---|---|---|---|
@@ -180,7 +203,7 @@ too small to publish; their old numbers still stand.
180
203
  > `http://localhost:11434/v1` (Ollama's OpenAI-compatible endpoint). A
181
204
  > 1.5B-param model running on a 4-year-old laptop reaches **88%** pass-rate
182
205
  > with simplicio's contract — same hardware, same model, raw prompt = 32%.
183
- > Reproduce: `BENCH_BASE_URL=http://localhost:11434/v1 BENCH_API_KEY=ollama
206
+ > Reproduce the legacy fallback: `BENCH_BASE_URL=http://localhost:11434/v1 BENCH_API_KEY=ollama
184
207
  > BENCH_MODELS="qwen2.5-coder:7b" python3 bench/run_offline.py`.
185
208
 
186
209
  #### Tiny models — sub-4B, run on 2026-05-26 (50 runs/side, 260 checks)
@@ -382,6 +405,29 @@ simplicio task "..." --stack angular --target ...
382
405
 
383
406
  How it works: simplicio shells out to `claude -p "<prompt>"` (or `codex exec "<prompt>"`) as a subprocess, captures stdout, runs the test loop. The inner CLI authenticates via your existing OAuth session in `~/.claude/` or `~/.codex/`. simplicio sets `SIMPLICIO_HOOK_GUARD=1` in the subprocess env so the inner Claude Code session does **not** re-fire simplicio's own UserPromptSubmit hook (no infinite recursion).
384
407
 
408
+ For orchestrators such as SendSprint, `simplicio task` also has a structured
409
+ contract:
410
+
411
+ ```bash
412
+ simplicio task "hide Delete button for non-admins" \
413
+ --stack angular \
414
+ --target src/app/screen/screen.component.html \
415
+ --dry-run-task \
416
+ --json
417
+
418
+ simplicio task "front-only task" \
419
+ --stack angular \
420
+ --target src/app/screen/screen.component.html \
421
+ --bound-paths "src/app/**" \
422
+ --json
423
+ ```
424
+
425
+ `--dry-run-task` generates the would-be diff/test output without applying or
426
+ testing it. `--json` returns `{task_id, applied, files_changed, tokens_used,
427
+ cost_usd, diff_summary, warnings}`. Repeat `--bound-paths <glob>` to reject
428
+ diffs outside the allowed edit surface; violations are reported in `warnings`
429
+ and the command exits non-zero.
430
+
385
431
  ### Path 3 example — standalone with API key
386
432
 
387
433
  ```bash
@@ -2,8 +2,8 @@ sentence-transformers>=2.2
2
2
  numpy>=1.23
3
3
  anthropic>=0.30
4
4
  openai>=1.30
5
- simplicio-mapper>=0.5.0
6
- simplicio-prompt>=1.7.0
5
+ simplicio-mapper>=0.6.0
6
+ simplicio-prompt>=1.9.0
7
7
  httpx>=0.27
8
8
  orjson>=3.10
9
9
  diskcache>=5.6
File without changes
File without changes