leanlab 0.2.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- leanlab/__init__.py +1 -0
- leanlab/cli.py +315 -0
- leanlab/core/__init__.py +1 -0
- leanlab/core/agents/__init__.py +10 -0
- leanlab/core/agents/claude.py +38 -0
- leanlab/core/agents/port.py +49 -0
- leanlab/core/agents/protocol.py +64 -0
- leanlab/core/coding/__init__.py +1 -0
- leanlab/core/coding/board.py +335 -0
- leanlab/core/coding/board_dist/assets/index-BBCkNArL.css +1 -0
- leanlab/core/coding/board_dist/assets/index-CNGMDAuO.js +40 -0
- leanlab/core/coding/board_dist/index.html +13 -0
- leanlab/core/coding/engineer.py +304 -0
- leanlab/core/coding/gate.py +63 -0
- leanlab/core/coding/personas.py +23 -0
- leanlab/core/coding/playbook.py +47 -0
- leanlab/core/coding/spec.py +232 -0
- leanlab/core/doctor.py +220 -0
- leanlab/core/init.py +219 -0
- leanlab/core/loop.py +374 -0
- leanlab/core/monitor.py +553 -0
- leanlab/templates/agents/CLAUDE.md +52 -0
- leanlab/templates/agents/critic.md +38 -0
- leanlab/templates/agents/director.md +37 -0
- leanlab/templates/agents/engineer.md +12 -0
- leanlab/templates/agents/reviewer.md +34 -0
- leanlab/templates/agents/techlead.md +7 -0
- leanlab/templates/skill/SKILL.md +99 -0
- leanlab-0.2.1.dist-info/METADATA +273 -0
- leanlab-0.2.1.dist-info/RECORD +33 -0
- leanlab-0.2.1.dist-info/WHEEL +4 -0
- leanlab-0.2.1.dist-info/entry_points.txt +2 -0
- leanlab-0.2.1.dist-info/licenses/LICENSE +21 -0
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
# Director — chief research strategist
|
|
2
|
+
|
|
3
|
+
## Who you are
|
|
4
|
+
|
|
5
|
+
You are a **world-class researcher** directing a team of experimenter agents in
|
|
6
|
+
this lab. Read `task.md` for the goal and objective. You are sharp, concrete, and
|
|
7
|
+
ambitious; you are not afraid of advanced methods, and you encourage the team to
|
|
8
|
+
research the web, use ML/stats, and install whatever they need. Never ban a whole
|
|
9
|
+
class of methods — steer with evidence.
|
|
10
|
+
|
|
11
|
+
## Your job
|
|
12
|
+
|
|
13
|
+
Every few experiments the loop wakes you to review progress and steer the team.
|
|
14
|
+
Study what has been built and rewrite one file — `Director_Notes.md` — that the
|
|
15
|
+
experimenters read before their next experiment.
|
|
16
|
+
|
|
17
|
+
## Steps
|
|
18
|
+
|
|
19
|
+
1. **Read the results.** Open `results.jsonl`. Look at every record's metrics and
|
|
20
|
+
notes. Keep the **objective** in `task.md` front of mind (which metric, and
|
|
21
|
+
whether higher or lower is better).
|
|
22
|
+
2. **Read the code.** Skim the best and worst experiment files to understand
|
|
23
|
+
*why* they worked or failed.
|
|
24
|
+
3. **Analyze.** Which families of ideas win? Which collapse? What promising,
|
|
25
|
+
unexplored direction would move the objective most?
|
|
26
|
+
4. **Write `Director_Notes.md`.** Overwrite it fresh. Keep it short and specific:
|
|
27
|
+
- **State of research** — what is winning, what is dead.
|
|
28
|
+
- **Directions to try next** — 3-6 concrete, frontier hypotheses with enough
|
|
29
|
+
detail that an experimenter can build them.
|
|
30
|
+
- **What to avoid** — only ideas the data has actually proven weak here.
|
|
31
|
+
|
|
32
|
+
## Rules
|
|
33
|
+
|
|
34
|
+
- Write **only** `Director_Notes.md`. Do not edit experiments, `results.jsonl`,
|
|
35
|
+
or any frozen file. Do not run `evaluation.py`.
|
|
36
|
+
- If your prompt includes an ARCHIVED note, drop references to removed files.
|
|
37
|
+
- Be the smartest person in the room. Give the team an edge.
|
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
# Engineer — implement the task
|
|
2
|
+
|
|
3
|
+
You are a senior engineer working in an isolated git worktree. Read `SPEC.md` and the
|
|
4
|
+
**locked acceptance tests**, then implement the change so the gate passes.
|
|
5
|
+
|
|
6
|
+
- Make the acceptance tests pass. Run the project's tests and lint locally to check.
|
|
7
|
+
- Follow the repository's existing conventions, structure, and style — read nearby code first.
|
|
8
|
+
- **Do NOT edit, weaken, or delete the locked acceptance tests.** They are frozen.
|
|
9
|
+
- Keep the change focused on this task. No unrelated edits.
|
|
10
|
+
- If you are given gate failures or reviewer feedback, fix exactly those.
|
|
11
|
+
|
|
12
|
+
You edit files directly with your tools. When the change is complete, stop.
|
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
# Reviewer — adversarial gatekeeper
|
|
2
|
+
|
|
3
|
+
You are a hostile, skeptical code reviewer. Your job is to **break this change**, not to bless it.
|
|
4
|
+
The gate (acceptance tests) has already passed — that is the floor, not proof of correctness.
|
|
5
|
+
Assume the engineer did the minimum to pass, missed edge cases, or tried to game the tests.
|
|
6
|
+
|
|
7
|
+
## Stance
|
|
8
|
+
- Your **default verdict is REQUEST CHANGES**. Approve only after you have actively tried to
|
|
9
|
+
break the code and failed.
|
|
10
|
+
- A passing gate is not enough. Tests prove the cases they cover; you hunt the cases they miss.
|
|
11
|
+
- Be concrete. For every problem, give a **trigger**: a specific input, edge case, or line —
|
|
12
|
+
and exactly what goes wrong. No vague notes like "could be cleaner".
|
|
13
|
+
|
|
14
|
+
## Attack checklist — actively look for a failure in each
|
|
15
|
+
- **Gaming** — hardcoded outputs, special-casing the test inputs, or any edit to the locked
|
|
16
|
+
acceptance tests. Reject immediately if the tests were touched.
|
|
17
|
+
- **Spec gaps** — requirements stated in the spec that the locked tests do NOT check. Find one
|
|
18
|
+
the code gets wrong.
|
|
19
|
+
- **Edge cases** — empty, zero, negative, huge, boundary, duplicate, unicode, None/null,
|
|
20
|
+
repeated or concurrent calls. Pick the input most likely to break this code.
|
|
21
|
+
- **Error paths** — bad input, missing file, network/timeout, raised exception: what happens?
|
|
22
|
+
- **Correctness** — off-by-one, wrong operator, integer division, mutable default, stale state.
|
|
23
|
+
- **Security** — injection, path traversal, leaked secrets, unsafe deserialization.
|
|
24
|
+
- **Scope** — only this task should have changed; flag any unrelated edit.
|
|
25
|
+
|
|
26
|
+
## Verdict
|
|
27
|
+
Build your single strongest counterexample first, then judge honestly.
|
|
28
|
+
- **approved** = true only if you found **no blocking defect** after genuinely trying to break it.
|
|
29
|
+
- **score** = your confidence it is correct: start at 50, subtract for each real defect, and reach
|
|
30
|
+
85+ only when you attacked it hard and it held.
|
|
31
|
+
- A pure nitpick (style, naming) is feedback, not a rejection — don't block on those alone.
|
|
32
|
+
|
|
33
|
+
Reply with ONLY this JSON object:
|
|
34
|
+
`{"approved": true|false, "score": <0-100>, "feedback": "<your counterexample(s) and what to fix; empty only if truly approved>"}`
|
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
# Tech lead — keep the project coherent
|
|
2
|
+
|
|
3
|
+
You are the tech lead. Your job is to keep the project healthy and steer what gets built.
|
|
4
|
+
|
|
5
|
+
Study the recent changes and outcomes, then rewrite `PLAYBOOK.md` with sharp, concrete
|
|
6
|
+
guidance for the next tasks: conventions to follow, the architecture map, pitfalls already
|
|
7
|
+
hit, and what to build or harden next. Write **only** that one file, then stop.
|
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: leanlab
|
|
3
|
+
description: >-
|
|
4
|
+
Use when the developer wants a coding task (a feature, endpoint, fix, refactor) done on THIS
|
|
5
|
+
repo through leanlab's honest, test-gated loop instead of editing files directly. Triggers on
|
|
6
|
+
requests like "use leanlab to add X", "spec/build this with leanlab", or "have the lab implement
|
|
7
|
+
X". leanlab writes locked acceptance tests first, then an engineer implements until the gate is
|
|
8
|
+
green and a reviewer approves, then merges.
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Driving leanlab
|
|
12
|
+
|
|
13
|
+
leanlab is a CLI already installed in this project. You (Claude Code) orchestrate it: you turn the
|
|
14
|
+
developer's request into a leanlab task, run it, and report the outcome. leanlab enforces honesty —
|
|
15
|
+
it writes acceptance tests, **locks** them, and only merges a change that truly passes them.
|
|
16
|
+
|
|
17
|
+
## When to use it
|
|
18
|
+
|
|
19
|
+
- The developer asks to add/implement/fix something **and** wants it test-gated / done "properly".
|
|
20
|
+
- The developer explicitly says "use leanlab" / "spec this" / "build this".
|
|
21
|
+
|
|
22
|
+
Do NOT use it for: quick edits, questions, or when the developer wants to write the code themselves.
|
|
23
|
+
|
|
24
|
+
## Preconditions (check first)
|
|
25
|
+
|
|
26
|
+
- The repo is a **git** repo and a **uv** project (has `pyproject.toml`). If not, tell the developer.
|
|
27
|
+
- `leanlab` is on PATH (`leanlab --help`). `spec` and `build` **spend Claude usage** — for a large
|
|
28
|
+
task, confirm with the developer before running `build`.
|
|
29
|
+
|
|
30
|
+
## The flow
|
|
31
|
+
|
|
32
|
+
1. **Spec the task** (headless — `--yes` auto-approves the drafted tests):
|
|
33
|
+
```bash
|
|
34
|
+
leanlab spec "<clear one-line task>" --yes
|
|
35
|
+
```
|
|
36
|
+
It prints a plain `slug: <slug>` line — capture `<slug>` from it (or `ls .leanlab/worktrees/`).
|
|
37
|
+
This created **locked** acceptance tests in an isolated worktree.
|
|
38
|
+
|
|
39
|
+
**You are the only reviewer of those tests** (`--yes` skipped the human approval). Read the
|
|
40
|
+
spec + the test files in `.leanlab/worktrees/<slug>/` and sanity-check they actually capture
|
|
41
|
+
the task. For anything non-trivial or risky, show them to the developer and get a thumbs-up
|
|
42
|
+
before building. If they're wrong, refine the task wording and re-run `spec` (it overwrites).
|
|
43
|
+
|
|
44
|
+
2. **Build it** (the engineer implements → gate → reviewer → merge; non-interactive):
|
|
45
|
+
```bash
|
|
46
|
+
leanlab build <slug>
|
|
47
|
+
```
|
|
48
|
+
Exit 0 = merged. The change is now on the main branch.
|
|
49
|
+
|
|
50
|
+
3. **Report** to the developer: show `git log --oneline -1` (the merge), the new/changed files,
|
|
51
|
+
and whether it merged. If it did NOT merge, read the build output for why (gate failures or
|
|
52
|
+
review feedback) and tell the developer; offer to refine the task and re-run.
|
|
53
|
+
|
|
54
|
+
4. **Clean up** when done: `leanlab clean` (removes merged task worktrees).
|
|
55
|
+
|
|
56
|
+
## Flags — IMPORTANT (you run without a terminal)
|
|
57
|
+
|
|
58
|
+
You call leanlab through Bash, which is **not an interactive terminal**. Commands that would
|
|
59
|
+
otherwise stop and ask a human will **hang** unless you pass `--yes`:
|
|
60
|
+
|
|
61
|
+
- `leanlab spec "<task>" --yes` — **always pass `--yes`.** It auto-approves the drafted acceptance
|
|
62
|
+
tests. Without it, the command hangs forever waiting for a person.
|
|
63
|
+
- `leanlab init <name> --yes` — (metric labs only) auto-approves the drafted evaluator.
|
|
64
|
+
|
|
65
|
+
`build`, `gate`, `board`, `check`, `fix`, `clean` need no `--yes` (they don't prompt).
|
|
66
|
+
|
|
67
|
+
Other flags worth knowing:
|
|
68
|
+
|
|
69
|
+
| Flag (on `build`) | Effect |
|
|
70
|
+
|-------------------|--------|
|
|
71
|
+
| `--max-attempts N` | cap engineer retries (default 5) |
|
|
72
|
+
| `--min-quality 80` | also require reviewer quality ≥ 80 to merge |
|
|
73
|
+
| `--reviewers 3` | adversarial review panel: 3 reviewers with different lenses (correctness/spec/security/robustness); merges only if all approve. Stricter, costs ~N× review tokens |
|
|
74
|
+
| `--no-playbook` | skip the tech-lead PLAYBOOK update (faster / cheaper) |
|
|
75
|
+
| `--persona-set coding\|metric` | which agent personas (default `coding`) |
|
|
76
|
+
| `--no-isolate` | skip the isolated acceptance re-run (rarely needed) |
|
|
77
|
+
|
|
78
|
+
`leanlab clean --all` removes ALL task worktrees (default removes only merged ones).
|
|
79
|
+
`leanlab init --for-agent` is the one-time setup that installed THIS skill — **don't** run it during a task.
|
|
80
|
+
|
|
81
|
+
## Useful commands
|
|
82
|
+
|
|
83
|
+
| Command | Use |
|
|
84
|
+
|---------|-----|
|
|
85
|
+
| `leanlab spec "<task>" --yes` | create locked acceptance tests for a task |
|
|
86
|
+
| `leanlab build <slug>` | implement to a green gate + review, then merge |
|
|
87
|
+
| `leanlab build <slug> --min-quality 80` | also require a reviewer quality ≥ 80 |
|
|
88
|
+
| `leanlab gate <slug>` | just run the pass/fail gate (free, no agents) |
|
|
89
|
+
| `leanlab check <lab>` / `leanlab fix <lab>` | (metric labs) verify / repair wiring |
|
|
90
|
+
| `leanlab board` | live dashboard of tasks + the PLAYBOOK |
|
|
91
|
+
| `leanlab clean [slug]` | remove task worktrees |
|
|
92
|
+
|
|
93
|
+
## Rules
|
|
94
|
+
|
|
95
|
+
- Keep each task **small and concrete** (one endpoint / one fix). Split big asks into several specs.
|
|
96
|
+
- Never hand-edit files inside `.leanlab/worktrees/` — let `build` drive the engineer.
|
|
97
|
+
- The acceptance tests are frozen; leanlab rejects any attempt that edits, deletes, or neuters them.
|
|
98
|
+
- After `build`, the tech-lead updates `.leanlab/PLAYBOOK.md` — the project's growing conventions.
|
|
99
|
+
You may read it to understand how this repo is built.
|
|
@@ -0,0 +1,273 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: leanlab
|
|
3
|
+
Version: 0.2.1
|
|
4
|
+
Summary: A self-improving lab for AI agents — evolve ML experiments against a frozen metric, or ship coding tasks through a spec → gate → review → merge loop with locked acceptance tests.
|
|
5
|
+
Project-URL: Homepage, https://github.com/bacharSalleh/leanlab
|
|
6
|
+
Project-URL: Repository, https://github.com/bacharSalleh/leanlab
|
|
7
|
+
Project-URL: Issues, https://github.com/bacharSalleh/leanlab/issues
|
|
8
|
+
Project-URL: Changelog, https://github.com/bacharSalleh/leanlab/blob/main/CHANGELOG.md
|
|
9
|
+
Author-email: Bashar <welcomebachar@gmail.com>
|
|
10
|
+
License: MIT License
|
|
11
|
+
|
|
12
|
+
Copyright (c) 2026 Bashar
|
|
13
|
+
|
|
14
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
15
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
16
|
+
in the Software without restriction, including without limitation the rights
|
|
17
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
18
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
19
|
+
furnished to do so, subject to the following conditions:
|
|
20
|
+
|
|
21
|
+
The above copyright notice and this permission notice shall be included in all
|
|
22
|
+
copies or substantial portions of the Software.
|
|
23
|
+
|
|
24
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
25
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
26
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
27
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
28
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
29
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
30
|
+
SOFTWARE.
|
|
31
|
+
License-File: LICENSE
|
|
32
|
+
Keywords: agents,claude,cli,coding-agent,evaluation,experiment,lab,llm,self-improving
|
|
33
|
+
Classifier: Development Status :: 4 - Beta
|
|
34
|
+
Classifier: Environment :: Console
|
|
35
|
+
Classifier: Intended Audience :: Developers
|
|
36
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
37
|
+
Classifier: Operating System :: OS Independent
|
|
38
|
+
Classifier: Programming Language :: Python :: 3
|
|
39
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
40
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
41
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
42
|
+
Classifier: Topic :: Software Development :: Quality Assurance
|
|
43
|
+
Classifier: Topic :: Software Development :: Testing
|
|
44
|
+
Requires-Python: >=3.11
|
|
45
|
+
Requires-Dist: questionary>=2
|
|
46
|
+
Requires-Dist: rich>=13
|
|
47
|
+
Description-Content-Type: text/markdown
|
|
48
|
+
|
|
49
|
+
# leanlab
|
|
50
|
+
|
|
51
|
+
[](https://pypi.org/project/leanlab/)
|
|
52
|
+
[](https://github.com/bacharSalleh/leanlab/actions/workflows/ci.yml)
|
|
53
|
+
[](https://pypi.org/project/leanlab/)
|
|
54
|
+
[](LICENSE)
|
|
55
|
+
|
|
56
|
+
```bash
|
|
57
|
+
pipx install leanlab # or: pip install leanlab · uvx leanlab
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
A small **tool for self-improving experiment labs**. A team of agents —
|
|
61
|
+
**Workers** (experimenters), a **Director**, and **HyperCritics** — evolve
|
|
62
|
+
solutions against a **frozen evaluator**, one experiment at a time. The same loop
|
|
63
|
+
drives any task: you just describe the *lab* and Claude builds the scorer.
|
|
64
|
+
|
|
65
|
+
It is the trading "selflearn" idea, generalized: **strategy → Experiment**,
|
|
66
|
+
**Manager → Director**, `results.csv → results.jsonl`, and the objective (what to
|
|
67
|
+
maximize or minimize) is configuration, not code.
|
|
68
|
+
|
|
69
|
+
leanlab is used **inside your own project** (like archik): each lab lives in a
|
|
70
|
+
`.leanlab/<name>/` folder; the engine stays in the installed tool.
|
|
71
|
+
|
|
72
|
+
## Quick start
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
uv tool install --force --editable /path/to/leanlab # install the `leanlab` tool
|
|
76
|
+
cd ~/my-project && uv init # your project (a uv project)
|
|
77
|
+
|
|
78
|
+
leanlab init iris # describe the task; Claude drafts the lab
|
|
79
|
+
leanlab check iris # verify it's wired correctly (free)
|
|
80
|
+
leanlab lock iris # freeze the scorer
|
|
81
|
+
leanlab run iris --n 5 # the agents evolve experiments (costs Claude)
|
|
82
|
+
leanlab serve iris # watch the live dashboard
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
**Full command guide:** [docs/USAGE.md](docs/USAGE.md) — the flow and what each
|
|
86
|
+
command does exactly.
|
|
87
|
+
|
|
88
|
+
## Anatomy
|
|
89
|
+
|
|
90
|
+
```
|
|
91
|
+
leanlab/ # the installable tool (engine — never copied into your project)
|
|
92
|
+
├── cli.py # commands: init · check · fix · run · serve · list · lock · unlock
|
|
93
|
+
├── core/
|
|
94
|
+
│ ├── loop.py # run N experiments, score, log, wake Director/Critic
|
|
95
|
+
│ ├── monitor.py # live dashboard: stat chips + progress chart + table + stream
|
|
96
|
+
│ ├── init.py # interactive `init` — Claude drafts task + evaluator
|
|
97
|
+
│ ├── doctor.py # preflight checks + Claude-powered `fix`
|
|
98
|
+
│ └── agents/ # ports & adapters — the backend-agnostic agent layer
|
|
99
|
+
└── templates/agents/ # CLAUDE.md (Worker) · director.md · critic.md (injected, not copied)
|
|
100
|
+
|
|
101
|
+
<your project>/.leanlab/<name>/ # a lab — only YOUR files
|
|
102
|
+
├── task.md goal + experiment contract
|
|
103
|
+
├── lab.json objective {metric, direction}, commands, cadences
|
|
104
|
+
├── evaluation.py the FROZEN evaluator → prints ONE line of JSON metrics
|
|
105
|
+
├── validate.py structural check the Worker runs (no score)
|
|
106
|
+
├── experiments/ where the Worker writes one file per loop
|
|
107
|
+
└── results.jsonl the book: one JSON record per experiment
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
**How a lab plugs in:** the engine never imports a lab. It runs the lab's
|
|
111
|
+
`validate_cmd` / `eval_cmd` (from `lab.json`) as subprocesses, reads the **JSON
|
|
112
|
+
metrics** the evaluator prints, and ranks by the configured **objective**. So a lab
|
|
113
|
+
can be ML, trading, graphics, optimization — anything that can print a metric.
|
|
114
|
+
|
|
115
|
+
## Make your own lab
|
|
116
|
+
|
|
117
|
+
`leanlab init <name>` is interactive: you describe the task in plain words, Claude
|
|
118
|
+
drafts `task.md` and picks the objective, then proposes an `evaluation.py` you
|
|
119
|
+
approve (or give feedback to revise). It installs the scorer's libraries and
|
|
120
|
+
self-checks the wiring before finishing. Then `leanlab lock <name>` and
|
|
121
|
+
`leanlab run <name>`.
|
|
122
|
+
|
|
123
|
+
If a lab is mis-wired, `leanlab check` tells you what's wrong and `leanlab fix`
|
|
124
|
+
has Claude repair it.
|
|
125
|
+
|
|
126
|
+
## The example lab: house-prices
|
|
127
|
+
|
|
128
|
+
This repo dogfoods itself — `.leanlab/house-prices` predicts California median
|
|
129
|
+
house value (**minimize RMSE**). Each experiment defines `build_estimator()` (any
|
|
130
|
+
scikit-learn-style model); the evaluator fits it on a fixed split and reports
|
|
131
|
+
`rmse / mae / r2 / overfit_gap / train_secs` on held-out data.
|
|
132
|
+
|
|
133
|
+
## Two lab types — naming map
|
|
134
|
+
|
|
135
|
+
leanlab runs the same loop two ways. A **metric lab** (ML/optimization — evolve a number)
|
|
136
|
+
and a **coding lab** (do coding tasks on a repo — pass tests). Same engine, different words:
|
|
137
|
+
|
|
138
|
+
**The team (agents)**
|
|
139
|
+
|
|
140
|
+
| Metric lab | Coding lab | Job |
|
|
141
|
+
|------------|-----------|-----|
|
|
142
|
+
| Worker (experimenter) | Engineer | makes the attempt |
|
|
143
|
+
| Director (chief scientist) | Tech-lead | steers + maintains the notes |
|
|
144
|
+
| Critic (red-team) | Reviewer | finds what's wrong |
|
|
145
|
+
| *(init drafts the lab)* | Spec-writer | turns a task into locked acceptance tests |
|
|
146
|
+
|
|
147
|
+
**Core concepts**
|
|
148
|
+
|
|
149
|
+
| Metric lab | Coding lab |
|
|
150
|
+
|------------|-----------|
|
|
151
|
+
| Experiment (one file in `experiments/`) | Change / diff (in a git worktree) |
|
|
152
|
+
| Frozen evaluator (`evaluation.py` → JSON metric) | Gate (locked acceptance tests + project tests) |
|
|
153
|
+
| Objective metric (min rmse / max acc) | pass/fail gate + quality score (0–100) |
|
|
154
|
+
| Memory (top-N best experiments, injected) | PLAYBOOK (project conventions, injected) |
|
|
155
|
+
| `Director_Notes.md` | `PLAYBOOK.md` |
|
|
156
|
+
| `Critic_Feedback.md` | reviewer feedback (inline, per build) |
|
|
157
|
+
| `results.jsonl` (one row per experiment) | `coding-results.jsonl` + git history |
|
|
158
|
+
| best-so-far (kept by ranking) | merged (kept by passing gate + review) |
|
|
159
|
+
| "lock the evaluator" | "lock the acceptance tests" (+ hash) |
|
|
160
|
+
|
|
161
|
+
**Commands**
|
|
162
|
+
|
|
163
|
+
| Metric lab | Coding lab |
|
|
164
|
+
|------------|-----------|
|
|
165
|
+
| `init` (scaffold a lab) | `spec` (define a task) |
|
|
166
|
+
| `run` (evolve experiments) | `build` (engineer a task) |
|
|
167
|
+
| `serve` (dashboard) | `board` (dashboard) |
|
|
168
|
+
| `lock` / `unlock` | (lock is automatic in `spec`) |
|
|
169
|
+
|
|
170
|
+
**archik nodes**
|
|
171
|
+
|
|
172
|
+
| Metric lab | Coding lab |
|
|
173
|
+
|------------|-----------|
|
|
174
|
+
| `loop` | `engineer` |
|
|
175
|
+
| `evaluator` | `gate-runner` |
|
|
176
|
+
| `results-store` | `playbook` + `coding-results` |
|
|
177
|
+
| `dashboard` | `coding-board` |
|
|
178
|
+
|
|
179
|
+
Same idea both ways: **make an attempt → judge it → keep the best → learn for next time** —
|
|
180
|
+
just "experiment + metric + memory" swapped for "code change + tests + playbook."
|
|
181
|
+
|
|
182
|
+
## The coding lab flow
|
|
183
|
+
|
|
184
|
+
A coding lab is an **assembly line with quality gates**. Each step hands off to the next, and
|
|
185
|
+
any failed gate sends the work back to the engineer — up to `--max-attempts`. Nothing reaches
|
|
186
|
+
`main` until the tests pass, the work is proven honest, and every reviewer approves.
|
|
187
|
+
|
|
188
|
+
```
|
|
189
|
+
Developer
|
|
190
|
+
│ leanlab spec "task"
|
|
191
|
+
▼
|
|
192
|
+
┌──────────────┐
|
|
193
|
+
│ Spec-writer │ drafts the spec + LOCKS the acceptance tests
|
|
194
|
+
└──────────────┘ (sha256, stored outside the worktree)
|
|
195
|
+
│ leanlab build <slug>
|
|
196
|
+
▼
|
|
197
|
+
┌──────────────┐ ◀──────────────────┐
|
|
198
|
+
│ Engineer │ implements in an │
|
|
199
|
+
└──────────────┘ isolated worktree │
|
|
200
|
+
│ │
|
|
201
|
+
▼ │
|
|
202
|
+
[ Gate ] locked tests pass │ fail →
|
|
203
|
+
│ │ fix & retry
|
|
204
|
+
▼ │ (≤ max-attempts)
|
|
205
|
+
[ Honesty checks ] no tampering, │
|
|
206
|
+
no gamed tests │
|
|
207
|
+
│ │
|
|
208
|
+
▼ │
|
|
209
|
+
[ Reviewer panel ] N lenses, │
|
|
210
|
+
ALL must approve ┘
|
|
211
|
+
│ all approve
|
|
212
|
+
▼
|
|
213
|
+
┌──────────────┐
|
|
214
|
+
│ Merge │ the change ships to main
|
|
215
|
+
└──────────────┘
|
|
216
|
+
│
|
|
217
|
+
▼
|
|
218
|
+
┌──────────────┐
|
|
219
|
+
│ Tech-lead │ rewrites PLAYBOOK.md → next task starts smarter
|
|
220
|
+
└──────────────┘
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
| Step | Who | What happens |
|
|
224
|
+
|------|-----|--------------|
|
|
225
|
+
| `leanlab spec "task"` | **Spec-writer** | Reads the repo, writes a spec + acceptance tests, then **locks** the tests (sha256 stored outside the worktree, so they can't be quietly edited). |
|
|
226
|
+
| `leanlab build <slug>` | **Engineer** | Implements the change in its own git worktree. |
|
|
227
|
+
| Gate | *automated* | Restores the pristine tests and runs them. Fail → back to the engineer with the failure. |
|
|
228
|
+
| Honesty checks | *automated* | (a) Were the locked tests touched? (b) Do they still pass **without** the engineer's own fixtures/conftest? Either trick → rejected. |
|
|
229
|
+
| Reviewer panel | **Reviewer(s)** | 1–N adversarial reviewers, each with a different lens (correctness / spec-conformance / security / robustness). **All must approve**; any blocker returns a concrete counterexample. Size it with `--reviewers N`. |
|
|
230
|
+
| Merge | *automated* | The branch merges into `main` — the change ships. |
|
|
231
|
+
| Playbook | **Tech-lead** | Rewrites `PLAYBOOK.md` so the next task starts with the project's conventions and pitfalls. |
|
|
232
|
+
|
|
233
|
+
Watch all of it live with `leanlab board`: the four roles, a per-task timeline, the agent chat
|
|
234
|
+
(every session, with token cost), and the growing playbook.
|
|
235
|
+
|
|
236
|
+
**Why it compounds:** every merged task adds its locked tests to `main` (a ratchet that never
|
|
237
|
+
loosens), and the playbook accumulates — so the lab keeps getting better at *your* project.
|
|
238
|
+
|
|
239
|
+
## Develop / test
|
|
240
|
+
|
|
241
|
+
```bash
|
|
242
|
+
uv sync
|
|
243
|
+
uv run pytest # the test suite
|
|
244
|
+
uv run leanlab list # run the tool from the checkout, no install
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
### Board UI (React + Tailwind)
|
|
248
|
+
|
|
249
|
+
The `leanlab board` dashboard is a React + Tailwind app in [`frontend/`](frontend/), built
|
|
250
|
+
into `leanlab/core/coding/board_dist/` and served by the Python board server. The Python side
|
|
251
|
+
exposes the data as `/api/state`, `/api/task`, and `/api/stream` (SSE); React renders it.
|
|
252
|
+
|
|
253
|
+
```bash
|
|
254
|
+
cd frontend && npm install && npm run build # compile the UI (re-run after editing src/)
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
For live UI work, run `leanlab board --no-open` (API on `:8766`) and `npm run dev` in `frontend/`
|
|
258
|
+
(Vite on `:5173`, proxying `/api`). The compiled `board_dist/` ships inside the wheel.
|
|
259
|
+
|
|
260
|
+
## Let Claude Code drive it
|
|
261
|
+
|
|
262
|
+
```bash
|
|
263
|
+
cd ~/my-project && leanlab init --for-agent # installs .claude/skills/leanlab/SKILL.md
|
|
264
|
+
```
|
|
265
|
+
Then talk to Claude Code — *"use leanlab to add a /health endpoint"* — and it specs, builds, and
|
|
266
|
+
merges through the honest test gate (`spec --yes` / `build` run headless). See `docs/USAGE.md`.
|
|
267
|
+
|
|
268
|
+
## Notes
|
|
269
|
+
|
|
270
|
+
- Agents get full tools and are told to be proactive researchers (web, ML, `uv add`).
|
|
271
|
+
- The Worker never runs the evaluator, so scores stay honest; `lock` freezes it.
|
|
272
|
+
- The evaluator (and agent specs) live in the package and are injected into prompts —
|
|
273
|
+
nothing framework-level is copied into your project.
|
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
leanlab/__init__.py,sha256=YJh6Z7LDL3h0dRowjKwbezpCQkcdhd894VbEDgK3SmE,56
|
|
2
|
+
leanlab/cli.py,sha256=0_UPRrtI4JU7lQmR_U_m_Ud_W4VZhc6sGUXavJYlkrg,12802
|
|
3
|
+
leanlab/core/__init__.py,sha256=z66ezqJp0ki17xdAE6_4kuQ9vmp4ddUY-hwY2i90Qvc,71
|
|
4
|
+
leanlab/core/doctor.py,sha256=ZyDNOO4JoFa9SE7WoSMBbuEMdZHcbduTFCI-pbw6trQ,9271
|
|
5
|
+
leanlab/core/init.py,sha256=lOqubbRhA2YZIP-m_MFPBzOAR6mQhIdVGJ7ey0dDz_k,10696
|
|
6
|
+
leanlab/core/loop.py,sha256=nWnV4N4PSvnDXOLa5-ahJjcZ7BjpUdu_s0O1CQtxe_8,15245
|
|
7
|
+
leanlab/core/monitor.py,sha256=JvZs-OCnYvEw2yZ69Y-JELPtjIUDjZaVpugJ7NYsZq0,26981
|
|
8
|
+
leanlab/core/agents/__init__.py,sha256=133o98a-kgQUYew7Tz7_KHF3v5EiQxf8PS6_EgTMPFk,354
|
|
9
|
+
leanlab/core/agents/claude.py,sha256=lJyNMOPGbCiFQSoU3bk_d5ytxo3rHvfb3kbxJv2T6zY,1482
|
|
10
|
+
leanlab/core/agents/port.py,sha256=XIIWLIMWvdR9MuR1eln6QKjK7_qA1xaf43AD3qw_7Nc,1880
|
|
11
|
+
leanlab/core/agents/protocol.py,sha256=NfGXAkEtRmR8P9XeiRWaWPzwrkd6aA6KcQPM1Pkm69E,2475
|
|
12
|
+
leanlab/core/coding/__init__.py,sha256=-bomohRLAXqa1_R8bGb-TkLhW7Zf1gCJviPN37MufFg,89
|
|
13
|
+
leanlab/core/coding/board.py,sha256=xqqGrfuaCjcnqO-OkQdxHxbpeIHpqaaLl02TSEaMxqE,12877
|
|
14
|
+
leanlab/core/coding/engineer.py,sha256=NdosqOEteBB5h8ZuVo7hu1WZR2nHBbUQLU5Go3RJmic,13141
|
|
15
|
+
leanlab/core/coding/gate.py,sha256=WQvYxJVYnrZkkWHebBUdoB7VLa3YLkkVKCFxf8mevb0,2175
|
|
16
|
+
leanlab/core/coding/personas.py,sha256=LRJVmAmHoaWXJUwEuX3Se-moDpT6Bbb1FCqWAjyp7E0,912
|
|
17
|
+
leanlab/core/coding/playbook.py,sha256=4yaDkrA0WzU8Dhs1YHVIvk3HGrCVJGt1Z-vgP9cVtYY,1844
|
|
18
|
+
leanlab/core/coding/spec.py,sha256=AVVQMDG_nuLmni55i4euzho5oZ-GfmuuVJE2QhbaPBM,10052
|
|
19
|
+
leanlab/core/coding/board_dist/index.html,sha256=ZKjaVreWnxhYuKUOcKiBpagEkm-l8kbrOtxmfRXa9pA,408
|
|
20
|
+
leanlab/core/coding/board_dist/assets/index-BBCkNArL.css,sha256=OLp59qr2tCqgNvZTPAuAkTQGp3h03FDwsZzWTLlWpzA,12940
|
|
21
|
+
leanlab/core/coding/board_dist/assets/index-CNGMDAuO.js,sha256=_Lt4TAQBhXEH92jiI3BjlgayddVZ8VVqXFhNXtsYjTk,158714
|
|
22
|
+
leanlab/templates/agents/CLAUDE.md,sha256=WDX0vu5_I6801k3R-_ibpjjOed_I_OTz8eJ1Qaw9_T8,2223
|
|
23
|
+
leanlab/templates/agents/critic.md,sha256=qm6jv7G_vo1YqWfOR1Q8RBvrmqGWznq28Lb1R8YGKFk,1747
|
|
24
|
+
leanlab/templates/agents/director.md,sha256=NOtK_wvnB486h_0AylKlcxMv_6m8je71CiTEELmN2t0,1737
|
|
25
|
+
leanlab/templates/agents/engineer.md,sha256=gaiE6WynIh7P-sLKN8Q1QiVnx7JWJOHuQRRsEip72wk,673
|
|
26
|
+
leanlab/templates/agents/reviewer.md,sha256=OEZOrZFfxQKNOwQrku-3kfqbWSqUiszHbSPR53g4xw0,2180
|
|
27
|
+
leanlab/templates/agents/techlead.md,sha256=8ejAivnppTlQwNeMdgVoQoLp4fA-HyAC9dsiU4YThGI,392
|
|
28
|
+
leanlab/templates/skill/SKILL.md,sha256=H1wiSSYahQFTguZKbIWf1P8cfi5fKBO0rcpxMYdLZ94,5052
|
|
29
|
+
leanlab-0.2.1.dist-info/METADATA,sha256=dq5aZEIG_yRTLXO4hXeb0R_rgpQkFe2NrGnLmid-YNE,13466
|
|
30
|
+
leanlab-0.2.1.dist-info/WHEEL,sha256=mffPy8wBnZQn2VnJUU5jE99KsxaSfiyMHV9Yt0aLVxs,87
|
|
31
|
+
leanlab-0.2.1.dist-info/entry_points.txt,sha256=UOqFyPpcnR095f9g4laknML2JJQDUEz7woHnpybZ0zs,45
|
|
32
|
+
leanlab-0.2.1.dist-info/licenses/LICENSE,sha256=vA4qJcb4fZyh5E05ruWo4wljvgSQGBgwNlPklZ9OQ3w,1063
|
|
33
|
+
leanlab-0.2.1.dist-info/RECORD,,
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Bashar
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|