noepicycle 0.1.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- noepicycle-0.1.1/.github/workflows/publish.yml +29 -0
- noepicycle-0.1.1/.gitignore +20 -0
- noepicycle-0.1.1/PKG-INFO +359 -0
- noepicycle-0.1.1/README.md +341 -0
- noepicycle-0.1.1/eval/benchmark.py +705 -0
- noepicycle-0.1.1/eval/results.json +482 -0
- noepicycle-0.1.1/eval_results/v0.1_4_7_26 +111 -0
- noepicycle-0.1.1/eval_results/v0.1_4_7_26_only_loopable_tasks +88 -0
- noepicycle-0.1.1/eval_results/v0.1_4_7_26_only_loopable_tasks_3_runs +179 -0
- noepicycle-0.1.1/eval_results/v0.2_full_benchmark.txt +263 -0
- noepicycle-0.1.1/eval_results/v0.2_full_benchmark_4_7_2026.txt +261 -0
- noepicycle-0.1.1/pyproject.toml +38 -0
- noepicycle-0.1.1/src/noepicycle/__init__.py +5 -0
- noepicycle-0.1.1/src/noepicycle/cli.py +267 -0
- noepicycle-0.1.1/src/noepicycle/default_loop.py +64 -0
- noepicycle-0.1.1/src/noepicycle/executor.py +329 -0
- noepicycle-0.1.1/src/noepicycle/graph.py +153 -0
- noepicycle-0.1.1/src/noepicycle/ladder.py +155 -0
- noepicycle-0.1.1/src/noepicycle/nodes.py +378 -0
- noepicycle-0.1.1/src/noepicycle/state.py +192 -0
- noepicycle-0.1.1/tests_reverse.py +16 -0
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
name: Publish to PyPI
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
branches:
|
|
6
|
+
- main
|
|
7
|
+
|
|
8
|
+
jobs:
|
|
9
|
+
publish:
|
|
10
|
+
runs-on: ubuntu-latest
|
|
11
|
+
steps:
|
|
12
|
+
- uses: actions/checkout@v4
|
|
13
|
+
|
|
14
|
+
- name: Set up Python
|
|
15
|
+
uses: actions/setup-python@v5
|
|
16
|
+
with:
|
|
17
|
+
python-version: "3.11"
|
|
18
|
+
|
|
19
|
+
- name: Install build tools
|
|
20
|
+
run: pip install build twine
|
|
21
|
+
|
|
22
|
+
- name: Build package
|
|
23
|
+
run: python -m build
|
|
24
|
+
|
|
25
|
+
- name: Publish to PyPI
|
|
26
|
+
env:
|
|
27
|
+
TWINE_USERNAME: __token__
|
|
28
|
+
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
|
|
29
|
+
run: twine upload dist/* --skip-existing
|
|
@@ -0,0 +1,359 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: noepicycle
|
|
3
|
+
Version: 0.1.1
|
|
4
|
+
Summary: A LangGraph supervisor that prevents agentic coding loops from going in circles and burning tokens
|
|
5
|
+
License: MIT
|
|
6
|
+
Requires-Python: >=3.11
|
|
7
|
+
Requires-Dist: anthropic>=0.25.0
|
|
8
|
+
Requires-Dist: langchain-anthropic>=0.1.0
|
|
9
|
+
Requires-Dist: langgraph>=0.2.0
|
|
10
|
+
Requires-Dist: python-dotenv>=1.0.0
|
|
11
|
+
Requires-Dist: rich>=13.0.0
|
|
12
|
+
Requires-Dist: typer>=0.12.0
|
|
13
|
+
Provides-Extra: dev
|
|
14
|
+
Requires-Dist: pytest-asyncio; extra == 'dev'
|
|
15
|
+
Requires-Dist: pytest>=8.0.0; extra == 'dev'
|
|
16
|
+
Requires-Dist: ruff; extra == 'dev'
|
|
17
|
+
Description-Content-Type: text/markdown
|
|
18
|
+
|
|
19
|
+
# noepicycle
|
|
20
|
+
|
|
21
|
+
As [Michael Shimeles](https://www.youtube.com/watch?v=7clJ8IH784Q) has pointed out, unless you work at Claude and have unlimited token access, agentic coding loops are prohibitively expensive. noepicycle is a LangGraph-native supervisor agent that prevents token cost from spiralling by detecting diminishing
|
|
22
|
+
returns in real time and dynamically switching models or stopping cleanly
|
|
23
|
+
when it's clear you've started burning through the token budget without improvements.
|
|
24
|
+
|
|
25
|
+
## The problem
|
|
26
|
+
|
|
27
|
+
Most agentic coding loops run until they hit a fixed iteration cap or
|
|
28
|
+
exhaust a token budget. But research shows that **rounds 1-2 capture
|
|
29
|
+
75% of reachable improvement** in LLM refinement loops [Williams, 2026].
|
|
30
|
+
Much that comes after that is diminishing returns or worse since continuing
|
|
31
|
+
to loop is actively harmful to code quality:
|
|
32
|
+
|
|
33
|
+
- **Self-conditioning**: models condition on their own prior failures,
|
|
34
|
+
making future mistakes *more* likely, not less [arXiv:2509.09677]
|
|
35
|
+
- **Context rot**: each iteration appends more context, degrading
|
|
36
|
+
performance by 15-20 percentage points by the time you've accumulated
|
|
37
|
+
~4,000 tokens of history [Stanford/Redis, 2025]
|
|
38
|
+
- **Oscillation**: loops are more less likely to plateau cleanly than to
|
|
39
|
+
oscillate, making a single-iteration dip look like recovery when
|
|
40
|
+
it isn't [arXiv:2509.04191]
|
|
41
|
+
|
|
42
|
+
## How it works
|
|
43
|
+
|
|
44
|
+
noepicycle runs as an **outer LangGraph supervisor** over your existing
|
|
45
|
+
inner coding loop (there's also a bare-bones inner loop provided).
|
|
46
|
+
After each iteration, it:
|
|
47
|
+
|
|
48
|
+
1. Scores the current solution against your scoring function
|
|
49
|
+
2. Computes the improvement delta vs. prior iterations
|
|
50
|
+
3. Checks the delta trend against research-backed thresholds
|
|
51
|
+
4. Makes one of three decisions:
|
|
52
|
+
- **Continue** — improvement is real, keep going
|
|
53
|
+
- **Switch** — plateau or regression detected; switch to a different
|
|
54
|
+
model with a clean context summary, breaking the self-conditioning
|
|
55
|
+
cycle
|
|
56
|
+
- **Stop** — solved, budget exhausted, or all ladder rungs tried
|
|
57
|
+
|
|
58
|
+
Model switching follows a **user-configurable ladder** — a ranked
|
|
59
|
+
graph of models with explicit transition rules for plateau, regression,
|
|
60
|
+
and budget-low events. The default ladder alternates between model
|
|
61
|
+
families to break reasoning ruts.
|
|
62
|
+
|
|
63
|
+
```
|
|
64
|
+
[Your inner loop agent]
|
|
65
|
+
↑ ↓ (one iteration)
|
|
66
|
+
[noepicycle supervisor]
|
|
67
|
+
→ score → delta → decision
|
|
68
|
+
↓
|
|
69
|
+
continue / switch model / stop
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
## Benchmark results (v0.1.0)
|
|
73
|
+
|
|
74
|
+
Evaluated across 5 hard coding tasks (CSV parsing, LRU cache, rate limiting,
|
|
75
|
+
expression evaluation, event emitter), 3 runs per condition, compared against
|
|
76
|
+
a fixed 5-iteration flat baseline using Claude Haiku.
|
|
77
|
+
|
|
78
|
+
| Task | noepicycle tokens | flat tokens | savings | accuracy |
|
|
79
|
+
|---|---|---|---|---|
|
|
80
|
+
| csv_parser | 624 ± 43 | 1,165 | 46% | 100% vs 100% |
|
|
81
|
+
| lru_cache | 946 ± 318 | 1,705 | 45% | 100% vs 100% |
|
|
82
|
+
| expression_evaluator | 2,034 ± 979 | 2,759 | 26% | **100% vs 96%** |
|
|
83
|
+
| event_emitter | 824 ± 53 | 1,362 | 40% | 100% vs 100% |
|
|
84
|
+
| hash_ring | 573 ± 225 | 578 | ~0% | 100% vs 100% |
|
|
85
|
+
|
|
86
|
+
**32% mean token savings across tasks where the supervisor's early-stopping
|
|
87
|
+
signal fired. On expression_evaluator, noepicycle also achieved higher
|
|
88
|
+
accuracy than the flat baseline (100% vs 96%) at lower cost.**
|
|
89
|
+
|
|
90
|
+
## Installation
|
|
91
|
+
|
|
92
|
+
pip install noepicycle
|
|
93
|
+
|
|
94
|
+
Requires Docker Desktop running (for sandboxed code execution).
|
|
95
|
+
Get Docker at https://docker.com/get-started
|
|
96
|
+
|
|
97
|
+
Set your Anthropic API key:
|
|
98
|
+
export ANTHROPIC_API_KEY=sk-ant-...
|
|
99
|
+
# or create a .env file in your project directory
|
|
100
|
+
|
|
101
|
+
## Quickstart
|
|
102
|
+
|
|
103
|
+
**CLI (no setup required beyond API key + Docker):**
|
|
104
|
+
|
|
105
|
+
# write a test file
|
|
106
|
+
echo 'from solution import f
|
|
107
|
+
def test_basic():
|
|
108
|
+
assert f([1,2,3]) == [3,2,1]' > tests.py
|
|
109
|
+
|
|
110
|
+
# run noepicycle
|
|
111
|
+
noepi run "Write a Python function f that reverses a list" \
|
|
112
|
+
--tests tests.py --budget 30000
|
|
113
|
+
|
|
114
|
+
**Python API:**
|
|
115
|
+
|
|
116
|
+
from noepicycle import Supervisor
|
|
117
|
+
|
|
118
|
+
supervisor = Supervisor(
|
|
119
|
+
test_code=open("tests.py").read(),
|
|
120
|
+
budget_cap=30_000,
|
|
121
|
+
)
|
|
122
|
+
result = supervisor.run("Write a Python function f that reverses a list")
|
|
123
|
+
print(result.solution)
|
|
124
|
+
print(f"Score: {result.score:.0%}, Tokens: {result.tokens_spent}")
|
|
125
|
+
|
|
126
|
+
## Plug-and-play interface
|
|
127
|
+
|
|
128
|
+
noepicycle wraps your loop, not the other way around, although it has a
|
|
129
|
+
canonical inner loop built in. You provide:
|
|
130
|
+
1. A **scoring callback** — any function that takes a solution string
|
|
131
|
+
and returns a float (0.0–1.0)
|
|
132
|
+
2. A **budget cap** — in tokens
|
|
133
|
+
3. Optionally, a **ladder config** — which models to use and when
|
|
134
|
+
|
|
135
|
+
```python
|
|
136
|
+
from noepicycle import Supervisor
|
|
137
|
+
|
|
138
|
+
supervisor = Supervisor(
|
|
139
|
+
score_fn=lambda solution: run_tests(solution), # your scorer
|
|
140
|
+
budget_cap=50_000, # tokens
|
|
141
|
+
# ladder="default" # or pass your own
|
|
142
|
+
)
|
|
143
|
+
|
|
144
|
+
result = supervisor.run(
|
|
145
|
+
task="Fix the failing authentication middleware",
|
|
146
|
+
inner_loop=your_langgraph_agent, # your existing agent
|
|
147
|
+
)
|
|
148
|
+
|
|
149
|
+
print(result.solution) # best solution found
|
|
150
|
+
print(result.iterations) # how many inner loop iterations ran
|
|
151
|
+
print(result.tokens_spent) # total tokens used
|
|
152
|
+
print(result.stop_reason) # "solved" | "plateau" | "budget" | "exhausted"
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
## The ladder
|
|
156
|
+
|
|
157
|
+
The supervisor navigates a model ladder — a directed graph of models
|
|
158
|
+
with transition rules for different signal types:
|
|
159
|
+
|
|
160
|
+
```
|
|
161
|
+
Signal types:
|
|
162
|
+
on_plateau → improvement delta < threshold for N consecutive iterations
|
|
163
|
+
on_regression → score decreased from prior iteration (iteration > 0 only)
|
|
164
|
+
on_budget_low → remaining budget below configured threshold
|
|
165
|
+
|
|
166
|
+
Default ladder (Claude-only, privacy-safe):
|
|
167
|
+
|
|
168
|
+
claude-haiku-4-5
|
|
169
|
+
on_plateau → claude-sonnet-4-6
|
|
170
|
+
on_regression → claude-sonnet-4-6
|
|
171
|
+
on_budget_low → stop
|
|
172
|
+
|
|
173
|
+
claude-sonnet-4-6
|
|
174
|
+
on_plateau → claude-opus-4-8
|
|
175
|
+
on_regression → claude-opus-4-8
|
|
176
|
+
on_budget_low → claude-haiku-4-5
|
|
177
|
+
|
|
178
|
+
claude-opus-4-8
|
|
179
|
+
on_plateau → stop (terminal)
|
|
180
|
+
on_regression → stop
|
|
181
|
+
on_budget_low → claude-sonnet-4-6
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
Bring your own ladder:
|
|
185
|
+
|
|
186
|
+
```python
|
|
187
|
+
from noepicycle import Supervisor, Ladder
|
|
188
|
+
|
|
189
|
+
my_ladder = Ladder({
|
|
190
|
+
"claude-haiku-4-5": {
|
|
191
|
+
"on_plateau": "claude-sonnet-4-6",
|
|
192
|
+
"on_regression": "claude-opus-4-8",
|
|
193
|
+
"on_budget_low": None,
|
|
194
|
+
},
|
|
195
|
+
# ... more rungs
|
|
196
|
+
})
|
|
197
|
+
|
|
198
|
+
supervisor = Supervisor(score_fn=..., budget_cap=..., ladder=my_ladder)
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
## Context transfer on model switch
|
|
202
|
+
|
|
203
|
+
When switching models, noepicycle does not pass the full failure
|
|
204
|
+
history to the new model — that would re-introduce the self-conditioning
|
|
205
|
+
problem. Instead, it runs a cheap summarization call (haiku) to produce
|
|
206
|
+
a clean briefing:
|
|
207
|
+
|
|
208
|
+
```
|
|
209
|
+
Task: [original task]
|
|
210
|
+
Best solution so far: [current best]
|
|
211
|
+
Approaches tried: [summary of what failed and why]
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
This breaks the self-conditioning cycle while preserving useful signal.
|
|
215
|
+
Configurable:
|
|
216
|
+
|
|
217
|
+
```python
|
|
218
|
+
Supervisor(..., context_transfer="summary") # default
|
|
219
|
+
Supervisor(..., context_transfer="reset") # task + best only, nothing else
|
|
220
|
+
Supervisor(..., context_transfer="full") # full history (not recommended)
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
## Stopping thresholds
|
|
224
|
+
|
|
225
|
+
Default thresholds are derived from published research:
|
|
226
|
+
|
|
227
|
+
| Parameter | Default | Source |
|
|
228
|
+
|---|---|---|
|
|
229
|
+
| delta_threshold | 0.02 | arXiv:2603.27440 |
|
|
230
|
+
| plateau_window | 2 | arXiv:2509.06770 |
|
|
231
|
+
| budget_low_pct | 0.25 | noepicycle default |
|
|
232
|
+
| budget_stop_pct | 0.05 | noepicycle default |
|
|
233
|
+
| grace_period | 2 | noepicycle default |
|
|
234
|
+
|
|
235
|
+
All configurable:
|
|
236
|
+
|
|
237
|
+
```python
|
|
238
|
+
Supervisor(
|
|
239
|
+
...,
|
|
240
|
+
delta_threshold=0.05, # stricter plateau detection
|
|
241
|
+
plateau_window=3, # require more evidence before switching
|
|
242
|
+
grace_period=3, # more cycles before reassessing after switch
|
|
243
|
+
)
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
## Low test coverage warning
|
|
247
|
+
|
|
248
|
+
noepicycle's convergence detection is most reliable with 5+ objective
|
|
249
|
+
test cases. With fewer tests, the scoring signal is coarser and deltas
|
|
250
|
+
noisier. If your scorer returns only binary values (0.0 or 1.0),
|
|
251
|
+
noepicycle will warn and widen the plateau_window automatically:
|
|
252
|
+
|
|
253
|
+
```
|
|
254
|
+
⚠ noepicycle: Binary scoring signal detected (possible low test
|
|
255
|
+
coverage). Widening plateau_window from 2 to 4 for more reliable
|
|
256
|
+
convergence detection. Consider adding more tests or using
|
|
257
|
+
score_fn="llm_judge" for a smoother signal.
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
## Privacy
|
|
261
|
+
|
|
262
|
+
noepicycle's default ladder uses Anthropic's Claude models only.
|
|
263
|
+
Anthropic's enterprise data handling policies apply.
|
|
264
|
+
|
|
265
|
+
An optional performance ladder including DeepSeek models is available
|
|
266
|
+
but **opt-in only**. DeepSeek is subject to Chinese data jurisdiction
|
|
267
|
+
and should not be used with proprietary, sensitive, or regulated code:
|
|
268
|
+
|
|
269
|
+
```python
|
|
270
|
+
# opt-in explicitly — do not use with sensitive codebases
|
|
271
|
+
from noepicycle.ladders import PERFORMANCE_LADDER
|
|
272
|
+
supervisor = Supervisor(..., ladder=PERFORMANCE_LADDER)
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
See [PRIVACY.md](./PRIVACY.md) for data handling details for each
|
|
276
|
+
supported model family.
|
|
277
|
+
|
|
278
|
+
## Research foundation
|
|
279
|
+
|
|
280
|
+
noepicycle's design is grounded in published findings:
|
|
281
|
+
|
|
282
|
+
**On diminishing returns:**
|
|
283
|
+
- Williams (2026): Rounds 1-2 capture 75% of reachable improvement
|
|
284
|
+
in LLM refinement loops. [LLM Verification Loops, Medium]
|
|
285
|
+
- REA-Coder (2026): Improvement from iterations 1-5 averages 12.18%;
|
|
286
|
+
iterations 5-10 yield only 6% additional gain. [arXiv:2604.16198]
|
|
287
|
+
- KubeGuard (2025): Oscillation with diminishing returns emerges
|
|
288
|
+
beyond iteration 3-4; early stopping readily mitigates it.
|
|
289
|
+
[arXiv:2509.04191]
|
|
290
|
+
|
|
291
|
+
**On self-conditioning (why loops get worse, not just flat):**
|
|
292
|
+
- "The Illusion of Diminishing Returns" (2026): Models condition on
|
|
293
|
+
their own prior mistakes, increasing future error rates beyond
|
|
294
|
+
long-context effects alone. [arXiv:2509.09677]
|
|
295
|
+
- "Contextual Drag" (2026): Iterative refinement collapses accuracy
|
|
296
|
+
for models with contextual drag; independent samples improve
|
|
297
|
+
steadily. [arXiv:2602.04288]
|
|
298
|
+
|
|
299
|
+
**On context rot:**
|
|
300
|
+
- Stanford/Redis (2025): Accuracy drops 15-20 percentage points with
|
|
301
|
+
~4,000 tokens of accumulated context. [redis.io/blog/context-rot]
|
|
302
|
+
|
|
303
|
+
**On coding loops specifically:**
|
|
304
|
+
- "Another Turn, Better Output?" (2025): Coding benefits from early
|
|
305
|
+
decision and restraint. If a correct path does not appear quickly,
|
|
306
|
+
stop or restart — do not push vague refinement. [arXiv:2509.06770]
|
|
307
|
+
|
|
308
|
+
**On default thresholds:**
|
|
309
|
+
- delta_threshold=0.02: If two consecutive iterations show delta <
|
|
310
|
+
0.02, additional refinement is unlikely to help. [arXiv:2603.27440]
|
|
311
|
+
|
|
312
|
+
**On model selection:**
|
|
313
|
+
- Harness/scaffolding moves SWE-bench results by 17-21 points — more
|
|
314
|
+
than model swaps alone. noepicycle changes both model and prompting
|
|
315
|
+
strategy on switch. [morphllm.com/best-ai-model-for-coding]
|
|
316
|
+
- SWE-bench Verified scores for default ladder: Opus 4.8 (88.6%),
|
|
317
|
+
Sonnet 4.6 (79.6%), Haiku 4.5 (~$0.13/solved task).
|
|
318
|
+
[morphllm.com/best-ai-model-for-coding]
|
|
319
|
+
|
|
320
|
+
## Status
|
|
321
|
+
|
|
322
|
+
- [x] README / research foundation
|
|
323
|
+
- [x] Core state schema with runtime evidence capture
|
|
324
|
+
- [x] Default model ladder (Claude-only + opt-in DeepSeek)
|
|
325
|
+
- [x] LangGraph supervisor graph
|
|
326
|
+
- [x] Docker-sandboxed executor with intermediate variable tracing
|
|
327
|
+
- [x] CLI (noepi run / --dry-run / check)
|
|
328
|
+
- [x] Preflight single-shot gate (avoids loop overhead on easy tasks)
|
|
329
|
+
- [x] Evaluation harness (3-run benchmark vs flat baseline)
|
|
330
|
+
- [x] PyPI package (pip install noepicycle)
|
|
331
|
+
- [ ] Loop Library submission
|
|
332
|
+
- [ ] MCP server for Claude Code integration
|
|
333
|
+
- [ ] Direction injection + constraint extraction (v1.5)
|
|
334
|
+
- [ ] Topology learning from run logs
|
|
335
|
+
- [ ] Loop-aware inner agent
|
|
336
|
+
|
|
337
|
+
## Known limitations (v0.1.0)
|
|
338
|
+
|
|
339
|
+
- High variance on tasks where the model produces slightly different
|
|
340
|
+
incorrect solutions each iteration — fixation detection requires
|
|
341
|
+
identical solution hashes, so near-identical broken solutions don't
|
|
342
|
+
trigger a switch as early as they should. Fix planned for v0.2.0.
|
|
343
|
+
- Tested on Claude Haiku only for the default inner loop. Sonnet/Opus
|
|
344
|
+
as inner loop models not yet benchmarked.
|
|
345
|
+
- Windows path handling in Docker executor may require Docker Desktop
|
|
346
|
+
running in Linux container mode.
|
|
347
|
+
|
|
348
|
+
## Contributing
|
|
349
|
+
|
|
350
|
+
Issues and PRs welcome.
|
|
351
|
+
|
|
352
|
+
If you run noepicycle on a task and get interesting results (especially
|
|
353
|
+
cases where the supervisor made a wrong call), open an issue with your
|
|
354
|
+
`eval/results.json` — real usage data helps improve the default ladder
|
|
355
|
+
and stopping thresholds.
|
|
356
|
+
|
|
357
|
+
## License
|
|
358
|
+
|
|
359
|
+
MIT
|