noepicycle 0.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,29 @@
1
+ name: Publish to PyPI
2
+
3
+ on:
4
+ push:
5
+ branches:
6
+ - main
7
+
8
+ jobs:
9
+ publish:
10
+ runs-on: ubuntu-latest
11
+ steps:
12
+ - uses: actions/checkout@v4
13
+
14
+ - name: Set up Python
15
+ uses: actions/setup-python@v5
16
+ with:
17
+ python-version: "3.11"
18
+
19
+ - name: Install build tools
20
+ run: pip install build twine
21
+
22
+ - name: Build package
23
+ run: python -m build
24
+
25
+ - name: Publish to PyPI
26
+ env:
27
+ TWINE_USERNAME: __token__
28
+ TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
29
+ run: twine upload dist/* --skip-existing
@@ -0,0 +1,20 @@
1
+ .env
2
+
3
+ __pycache__/
4
+ *.pyc
5
+ *.pyo
6
+ .ruff_cache/
7
+ *.egg-info/
8
+ dist/
9
+ build/
10
+
11
+ venv/
12
+ .venv/
13
+
14
+ .vscode/
15
+
16
+ eval/results/
17
+ *.log
18
+
19
+ .DS_Store
20
+ Thumbs.db
@@ -0,0 +1,359 @@
1
+ Metadata-Version: 2.4
2
+ Name: noepicycle
3
+ Version: 0.1.1
4
+ Summary: A LangGraph supervisor that prevents agentic coding loops from going in circles and burning tokens
5
+ License: MIT
6
+ Requires-Python: >=3.11
7
+ Requires-Dist: anthropic>=0.25.0
8
+ Requires-Dist: langchain-anthropic>=0.1.0
9
+ Requires-Dist: langgraph>=0.2.0
10
+ Requires-Dist: python-dotenv>=1.0.0
11
+ Requires-Dist: rich>=13.0.0
12
+ Requires-Dist: typer>=0.12.0
13
+ Provides-Extra: dev
14
+ Requires-Dist: pytest-asyncio; extra == 'dev'
15
+ Requires-Dist: pytest>=8.0.0; extra == 'dev'
16
+ Requires-Dist: ruff; extra == 'dev'
17
+ Description-Content-Type: text/markdown
18
+
19
+ # noepicycle
20
+
21
+ As [Michael Shimeles](https://www.youtube.com/watch?v=7clJ8IH784Q) has pointed out, unless you work at Claude and have unlimited token access, agentic coding loops are prohibitively expensive. noepicycle is a LangGraph-native supervisor agent that prevents token cost from spiralling by detecting diminishing
22
+ returns in real time and dynamically switching models or stopping cleanly
23
+ when it's clear you've started burning through the token budget without improvements.
24
+
25
+ ## The problem
26
+
27
+ Most agentic coding loops run until they hit a fixed iteration cap or
28
+ exhaust a token budget. But research shows that **rounds 1-2 capture
29
+ 75% of reachable improvement** in LLM refinement loops [Williams, 2026].
30
+ Much that comes after that is diminishing returns or worse since continuing
31
+ to loop is actively harmful to code quality:
32
+
33
+ - **Self-conditioning**: models condition on their own prior failures,
34
+ making future mistakes *more* likely, not less [arXiv:2509.09677]
35
+ - **Context rot**: each iteration appends more context, degrading
36
+ performance by 15-20 percentage points by the time you've accumulated
37
+ ~4,000 tokens of history [Stanford/Redis, 2025]
38
+ - **Oscillation**: loops are more less likely to plateau cleanly than to
39
+ oscillate, making a single-iteration dip look like recovery when
40
+ it isn't [arXiv:2509.04191]
41
+
42
+ ## How it works
43
+
44
+ noepicycle runs as an **outer LangGraph supervisor** over your existing
45
+ inner coding loop (there's also a bare-bones inner loop provided).
46
+ After each iteration, it:
47
+
48
+ 1. Scores the current solution against your scoring function
49
+ 2. Computes the improvement delta vs. prior iterations
50
+ 3. Checks the delta trend against research-backed thresholds
51
+ 4. Makes one of three decisions:
52
+ - **Continue** — improvement is real, keep going
53
+ - **Switch** — plateau or regression detected; switch to a different
54
+ model with a clean context summary, breaking the self-conditioning
55
+ cycle
56
+ - **Stop** — solved, budget exhausted, or all ladder rungs tried
57
+
58
+ Model switching follows a **user-configurable ladder** — a ranked
59
+ graph of models with explicit transition rules for plateau, regression,
60
+ and budget-low events. The default ladder alternates between model
61
+ families to break reasoning ruts.
62
+
63
+ ```
64
+ [Your inner loop agent]
65
+ ↑ ↓ (one iteration)
66
+ [noepicycle supervisor]
67
+ → score → delta → decision
68
+
69
+ continue / switch model / stop
70
+ ```
71
+
72
+ ## Benchmark results (v0.1.0)
73
+
74
+ Evaluated across 5 hard coding tasks (CSV parsing, LRU cache, rate limiting,
75
+ expression evaluation, event emitter), 3 runs per condition, compared against
76
+ a fixed 5-iteration flat baseline using Claude Haiku.
77
+
78
+ | Task | noepicycle tokens | flat tokens | savings | accuracy |
79
+ |---|---|---|---|---|
80
+ | csv_parser | 624 ± 43 | 1,165 | 46% | 100% vs 100% |
81
+ | lru_cache | 946 ± 318 | 1,705 | 45% | 100% vs 100% |
82
+ | expression_evaluator | 2,034 ± 979 | 2,759 | 26% | **100% vs 96%** |
83
+ | event_emitter | 824 ± 53 | 1,362 | 40% | 100% vs 100% |
84
+ | hash_ring | 573 ± 225 | 578 | ~0% | 100% vs 100% |
85
+
86
+ **32% mean token savings across tasks where the supervisor's early-stopping
87
+ signal fired. On expression_evaluator, noepicycle also achieved higher
88
+ accuracy than the flat baseline (100% vs 96%) at lower cost.**
89
+
90
+ ## Installation
91
+
92
+ pip install noepicycle
93
+
94
+ Requires Docker Desktop running (for sandboxed code execution).
95
+ Get Docker at https://docker.com/get-started
96
+
97
+ Set your Anthropic API key:
98
+ export ANTHROPIC_API_KEY=sk-ant-...
99
+ # or create a .env file in your project directory
100
+
101
+ ## Quickstart
102
+
103
+ **CLI (no setup required beyond API key + Docker):**
104
+
105
+ # write a test file
106
+ echo 'from solution import f
107
+ def test_basic():
108
+ assert f([1,2,3]) == [3,2,1]' > tests.py
109
+
110
+ # run noepicycle
111
+ noepi run "Write a Python function f that reverses a list" \
112
+ --tests tests.py --budget 30000
113
+
114
+ **Python API:**
115
+
116
+ from noepicycle import Supervisor
117
+
118
+ supervisor = Supervisor(
119
+ test_code=open("tests.py").read(),
120
+ budget_cap=30_000,
121
+ )
122
+ result = supervisor.run("Write a Python function f that reverses a list")
123
+ print(result.solution)
124
+ print(f"Score: {result.score:.0%}, Tokens: {result.tokens_spent}")
125
+
126
+ ## Plug-and-play interface
127
+
128
+ noepicycle wraps your loop, not the other way around, although it has a
129
+ canonical inner loop built in. You provide:
130
+ 1. A **scoring callback** — any function that takes a solution string
131
+ and returns a float (0.0–1.0)
132
+ 2. A **budget cap** — in tokens
133
+ 3. Optionally, a **ladder config** — which models to use and when
134
+
135
+ ```python
136
+ from noepicycle import Supervisor
137
+
138
+ supervisor = Supervisor(
139
+ score_fn=lambda solution: run_tests(solution), # your scorer
140
+ budget_cap=50_000, # tokens
141
+ # ladder="default" # or pass your own
142
+ )
143
+
144
+ result = supervisor.run(
145
+ task="Fix the failing authentication middleware",
146
+ inner_loop=your_langgraph_agent, # your existing agent
147
+ )
148
+
149
+ print(result.solution) # best solution found
150
+ print(result.iterations) # how many inner loop iterations ran
151
+ print(result.tokens_spent) # total tokens used
152
+ print(result.stop_reason) # "solved" | "plateau" | "budget" | "exhausted"
153
+ ```
154
+
155
+ ## The ladder
156
+
157
+ The supervisor navigates a model ladder — a directed graph of models
158
+ with transition rules for different signal types:
159
+
160
+ ```
161
+ Signal types:
162
+ on_plateau → improvement delta < threshold for N consecutive iterations
163
+ on_regression → score decreased from prior iteration (iteration > 0 only)
164
+ on_budget_low → remaining budget below configured threshold
165
+
166
+ Default ladder (Claude-only, privacy-safe):
167
+
168
+ claude-haiku-4-5
169
+ on_plateau → claude-sonnet-4-6
170
+ on_regression → claude-sonnet-4-6
171
+ on_budget_low → stop
172
+
173
+ claude-sonnet-4-6
174
+ on_plateau → claude-opus-4-8
175
+ on_regression → claude-opus-4-8
176
+ on_budget_low → claude-haiku-4-5
177
+
178
+ claude-opus-4-8
179
+ on_plateau → stop (terminal)
180
+ on_regression → stop
181
+ on_budget_low → claude-sonnet-4-6
182
+ ```
183
+
184
+ Bring your own ladder:
185
+
186
+ ```python
187
+ from noepicycle import Supervisor, Ladder
188
+
189
+ my_ladder = Ladder({
190
+ "claude-haiku-4-5": {
191
+ "on_plateau": "claude-sonnet-4-6",
192
+ "on_regression": "claude-opus-4-8",
193
+ "on_budget_low": None,
194
+ },
195
+ # ... more rungs
196
+ })
197
+
198
+ supervisor = Supervisor(score_fn=..., budget_cap=..., ladder=my_ladder)
199
+ ```
200
+
201
+ ## Context transfer on model switch
202
+
203
+ When switching models, noepicycle does not pass the full failure
204
+ history to the new model — that would re-introduce the self-conditioning
205
+ problem. Instead, it runs a cheap summarization call (haiku) to produce
206
+ a clean briefing:
207
+
208
+ ```
209
+ Task: [original task]
210
+ Best solution so far: [current best]
211
+ Approaches tried: [summary of what failed and why]
212
+ ```
213
+
214
+ This breaks the self-conditioning cycle while preserving useful signal.
215
+ Configurable:
216
+
217
+ ```python
218
+ Supervisor(..., context_transfer="summary") # default
219
+ Supervisor(..., context_transfer="reset") # task + best only, nothing else
220
+ Supervisor(..., context_transfer="full") # full history (not recommended)
221
+ ```
222
+
223
+ ## Stopping thresholds
224
+
225
+ Default thresholds are derived from published research:
226
+
227
+ | Parameter | Default | Source |
228
+ |---|---|---|
229
+ | delta_threshold | 0.02 | arXiv:2603.27440 |
230
+ | plateau_window | 2 | arXiv:2509.06770 |
231
+ | budget_low_pct | 0.25 | noepicycle default |
232
+ | budget_stop_pct | 0.05 | noepicycle default |
233
+ | grace_period | 2 | noepicycle default |
234
+
235
+ All configurable:
236
+
237
+ ```python
238
+ Supervisor(
239
+ ...,
240
+ delta_threshold=0.05, # stricter plateau detection
241
+ plateau_window=3, # require more evidence before switching
242
+ grace_period=3, # more cycles before reassessing after switch
243
+ )
244
+ ```
245
+
246
+ ## Low test coverage warning
247
+
248
+ noepicycle's convergence detection is most reliable with 5+ objective
249
+ test cases. With fewer tests, the scoring signal is coarser and deltas
250
+ noisier. If your scorer returns only binary values (0.0 or 1.0),
251
+ noepicycle will warn and widen the plateau_window automatically:
252
+
253
+ ```
254
+ ⚠ noepicycle: Binary scoring signal detected (possible low test
255
+ coverage). Widening plateau_window from 2 to 4 for more reliable
256
+ convergence detection. Consider adding more tests or using
257
+ score_fn="llm_judge" for a smoother signal.
258
+ ```
259
+
260
+ ## Privacy
261
+
262
+ noepicycle's default ladder uses Anthropic's Claude models only.
263
+ Anthropic's enterprise data handling policies apply.
264
+
265
+ An optional performance ladder including DeepSeek models is available
266
+ but **opt-in only**. DeepSeek is subject to Chinese data jurisdiction
267
+ and should not be used with proprietary, sensitive, or regulated code:
268
+
269
+ ```python
270
+ # opt-in explicitly — do not use with sensitive codebases
271
+ from noepicycle.ladders import PERFORMANCE_LADDER
272
+ supervisor = Supervisor(..., ladder=PERFORMANCE_LADDER)
273
+ ```
274
+
275
+ See [PRIVACY.md](./PRIVACY.md) for data handling details for each
276
+ supported model family.
277
+
278
+ ## Research foundation
279
+
280
+ noepicycle's design is grounded in published findings:
281
+
282
+ **On diminishing returns:**
283
+ - Williams (2026): Rounds 1-2 capture 75% of reachable improvement
284
+ in LLM refinement loops. [LLM Verification Loops, Medium]
285
+ - REA-Coder (2026): Improvement from iterations 1-5 averages 12.18%;
286
+ iterations 5-10 yield only 6% additional gain. [arXiv:2604.16198]
287
+ - KubeGuard (2025): Oscillation with diminishing returns emerges
288
+ beyond iteration 3-4; early stopping readily mitigates it.
289
+ [arXiv:2509.04191]
290
+
291
+ **On self-conditioning (why loops get worse, not just flat):**
292
+ - "The Illusion of Diminishing Returns" (2026): Models condition on
293
+ their own prior mistakes, increasing future error rates beyond
294
+ long-context effects alone. [arXiv:2509.09677]
295
+ - "Contextual Drag" (2026): Iterative refinement collapses accuracy
296
+ for models with contextual drag; independent samples improve
297
+ steadily. [arXiv:2602.04288]
298
+
299
+ **On context rot:**
300
+ - Stanford/Redis (2025): Accuracy drops 15-20 percentage points with
301
+ ~4,000 tokens of accumulated context. [redis.io/blog/context-rot]
302
+
303
+ **On coding loops specifically:**
304
+ - "Another Turn, Better Output?" (2025): Coding benefits from early
305
+ decision and restraint. If a correct path does not appear quickly,
306
+ stop or restart — do not push vague refinement. [arXiv:2509.06770]
307
+
308
+ **On default thresholds:**
309
+ - delta_threshold=0.02: If two consecutive iterations show delta <
310
+ 0.02, additional refinement is unlikely to help. [arXiv:2603.27440]
311
+
312
+ **On model selection:**
313
+ - Harness/scaffolding moves SWE-bench results by 17-21 points — more
314
+ than model swaps alone. noepicycle changes both model and prompting
315
+ strategy on switch. [morphllm.com/best-ai-model-for-coding]
316
+ - SWE-bench Verified scores for default ladder: Opus 4.8 (88.6%),
317
+ Sonnet 4.6 (79.6%), Haiku 4.5 (~$0.13/solved task).
318
+ [morphllm.com/best-ai-model-for-coding]
319
+
320
+ ## Status
321
+
322
+ - [x] README / research foundation
323
+ - [x] Core state schema with runtime evidence capture
324
+ - [x] Default model ladder (Claude-only + opt-in DeepSeek)
325
+ - [x] LangGraph supervisor graph
326
+ - [x] Docker-sandboxed executor with intermediate variable tracing
327
+ - [x] CLI (noepi run / --dry-run / check)
328
+ - [x] Preflight single-shot gate (avoids loop overhead on easy tasks)
329
+ - [x] Evaluation harness (3-run benchmark vs flat baseline)
330
+ - [x] PyPI package (pip install noepicycle)
331
+ - [ ] Loop Library submission
332
+ - [ ] MCP server for Claude Code integration
333
+ - [ ] Direction injection + constraint extraction (v1.5)
334
+ - [ ] Topology learning from run logs
335
+ - [ ] Loop-aware inner agent
336
+
337
+ ## Known limitations (v0.1.0)
338
+
339
+ - High variance on tasks where the model produces slightly different
340
+ incorrect solutions each iteration — fixation detection requires
341
+ identical solution hashes, so near-identical broken solutions don't
342
+ trigger a switch as early as they should. Fix planned for v0.2.0.
343
+ - Tested on Claude Haiku only for the default inner loop. Sonnet/Opus
344
+ as inner loop models not yet benchmarked.
345
+ - Windows path handling in Docker executor may require Docker Desktop
346
+ running in Linux container mode.
347
+
348
+ ## Contributing
349
+
350
+ Issues and PRs welcome.
351
+
352
+ If you run noepicycle on a task and get interesting results (especially
353
+ cases where the supervisor made a wrong call), open an issue with your
354
+ `eval/results.json` — real usage data helps improve the default ladder
355
+ and stopping thresholds.
356
+
357
+ ## License
358
+
359
+ MIT