skillrl 1.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- skillrl-1.0.0/LICENSE +21 -0
- skillrl-1.0.0/MANIFEST.in +10 -0
- skillrl-1.0.0/PKG-INFO +362 -0
- skillrl-1.0.0/README.md +306 -0
- skillrl-1.0.0/pyproject.toml +52 -0
- skillrl-1.0.0/setup.cfg +4 -0
- skillrl-1.0.0/skillrl/__init__.py +49 -0
- skillrl-1.0.0/skillrl/config.py +148 -0
- skillrl-1.0.0/skillrl/core/__init__.py +36 -0
- skillrl-1.0.0/skillrl/core/editor.py +110 -0
- skillrl-1.0.0/skillrl/core/gate.py +94 -0
- skillrl-1.0.0/skillrl/core/scheduler.py +88 -0
- skillrl-1.0.0/skillrl/core/utils.py +96 -0
- skillrl-1.0.0/skillrl/envs/__init__.py +19 -0
- skillrl-1.0.0/skillrl/envs/base.py +52 -0
- skillrl-1.0.0/skillrl/envs/qa.py +163 -0
- skillrl-1.0.0/skillrl/llm/__init__.py +19 -0
- skillrl-1.0.0/skillrl/llm/base.py +56 -0
- skillrl-1.0.0/skillrl/llm/openai_client.py +163 -0
- skillrl-1.0.0/skillrl/pipeline/__init__.py +22 -0
- skillrl-1.0.0/skillrl/pipeline/aggregate.py +220 -0
- skillrl-1.0.0/skillrl/pipeline/reflect.py +253 -0
- skillrl-1.0.0/skillrl/pipeline/rollout.py +93 -0
- skillrl-1.0.0/skillrl/pipeline/select.py +110 -0
- skillrl-1.0.0/skillrl/prompts/__init__.py +53 -0
- skillrl-1.0.0/skillrl/prompts/analyst_error.md +35 -0
- skillrl-1.0.0/skillrl/prompts/analyst_success.md +31 -0
- skillrl-1.0.0/skillrl/prompts/merge_failure.md +22 -0
- skillrl-1.0.0/skillrl/prompts/merge_final.md +23 -0
- skillrl-1.0.0/skillrl/prompts/merge_success.md +19 -0
- skillrl-1.0.0/skillrl/prompts/ranking.md +23 -0
- skillrl-1.0.0/skillrl/py.typed +0 -0
- skillrl-1.0.0/skillrl/trainer.py +714 -0
- skillrl-1.0.0/skillrl/types.py +241 -0
- skillrl-1.0.0/skillrl.egg-info/PKG-INFO +362 -0
- skillrl-1.0.0/skillrl.egg-info/SOURCES.txt +37 -0
- skillrl-1.0.0/skillrl.egg-info/dependency_links.txt +1 -0
- skillrl-1.0.0/skillrl.egg-info/requires.txt +10 -0
- skillrl-1.0.0/skillrl.egg-info/top_level.txt +1 -0
skillrl-1.0.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 skillrl contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
include README.md
|
|
2
|
+
include LICENSE
|
|
3
|
+
include pyproject.toml
|
|
4
|
+
recursive-include skillrl/prompts *.md
|
|
5
|
+
recursive-include skillrl/envs/data *.json
|
|
6
|
+
recursive-exclude tests *
|
|
7
|
+
recursive-exclude examples *
|
|
8
|
+
global-exclude __pycache__
|
|
9
|
+
global-exclude *.py[cod]
|
|
10
|
+
global-exclude .DS_Store
|
skillrl-1.0.0/PKG-INFO
ADDED
|
@@ -0,0 +1,362 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: skillrl
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: A TRL-like training library for end-to-end skill optimization of frozen LLM agents (based on Microsoft SkillOpt).
|
|
5
|
+
Author: skillrl contributors
|
|
6
|
+
License: MIT License
|
|
7
|
+
|
|
8
|
+
Copyright (c) 2026 skillrl contributors
|
|
9
|
+
|
|
10
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
11
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
12
|
+
in the Software without restriction, including without limitation the rights
|
|
13
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
14
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
15
|
+
furnished to do so, subject to the following conditions:
|
|
16
|
+
|
|
17
|
+
The above copyright notice and this permission notice shall be included in all
|
|
18
|
+
copies or substantial portions of the Software.
|
|
19
|
+
|
|
20
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
21
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
22
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
23
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
24
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
25
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
26
|
+
SOFTWARE.
|
|
27
|
+
|
|
28
|
+
Project-URL: Homepage, https://github.com/Xia12121/SkillRL
|
|
29
|
+
Project-URL: Repository, https://github.com/Xia12121/SkillRL
|
|
30
|
+
Project-URL: Issues, https://github.com/Xia12121/SkillRL/issues
|
|
31
|
+
Project-URL: Reference Repo, https://github.com/microsoft/SkillOpt
|
|
32
|
+
Keywords: llm,agent,skill,prompt-optimization,text-space-optimization,skillopt,trl
|
|
33
|
+
Classifier: Development Status :: 4 - Beta
|
|
34
|
+
Classifier: Programming Language :: Python :: 3
|
|
35
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
36
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
37
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
38
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
39
|
+
Classifier: Operating System :: OS Independent
|
|
40
|
+
Classifier: Intended Audience :: Science/Research
|
|
41
|
+
Classifier: Intended Audience :: Developers
|
|
42
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
43
|
+
Requires-Python: >=3.10
|
|
44
|
+
Description-Content-Type: text/markdown
|
|
45
|
+
License-File: LICENSE
|
|
46
|
+
Requires-Dist: openai>=1.40.0
|
|
47
|
+
Requires-Dist: pyyaml>=6.0
|
|
48
|
+
Requires-Dist: tqdm>=4.65.0
|
|
49
|
+
Provides-Extra: dev
|
|
50
|
+
Requires-Dist: pytest>=7.0; extra == "dev"
|
|
51
|
+
Requires-Dist: pytest-mock>=3.10; extra == "dev"
|
|
52
|
+
Requires-Dist: ruff>=0.5; extra == "dev"
|
|
53
|
+
Requires-Dist: build>=1.2; extra == "dev"
|
|
54
|
+
Requires-Dist: twine>=5.0; extra == "dev"
|
|
55
|
+
Dynamic: license-file
|
|
56
|
+
|
|
57
|
+
# skillrl
|
|
58
|
+
|
|
59
|
+
> **A TRL-like training library for end-to-end skill optimization of frozen LLM agents.**
|
|
60
|
+
> Implements the core algorithm of Microsoft **SkillOpt** ([project page](https://microsoft.github.io/SkillOpt/), [repo](https://github.com/microsoft/SkillOpt)) as a clean, modular Python package — designed to grow into the *TRL of skill / prompt-space optimization*.
|
|
61
|
+
|
|
62
|
+
---
|
|
63
|
+
|
|
64
|
+
## 1. What is this?
|
|
65
|
+
|
|
66
|
+
Modern LLM agents are usually improved either by **fine-tuning weights** (expensive, opaque) or by **hand-tweaking prompts** (cheap, brittle, ad-hoc).
|
|
67
|
+
|
|
68
|
+
**SkillOpt** proposes a third path: treat a **natural-language *skill document*** (a markdown file of guidelines, heuristics, do/don'ts) as the *trainable state*. Both the **target LLM** (the agent that uses the skill) and the **optimizer LLM** (the model that critiques and rewrites the skill) stay **frozen**. Gradient descent is replaced by a textual analogue:
|
|
69
|
+
|
|
70
|
+
| SGD on weights | SkillOpt on skill text |
|
|
71
|
+
|---------------------------------|-------------------------------------------------|
|
|
72
|
+
| Forward pass | **Rollout**: target agent runs the skill on a batch |
|
|
73
|
+
| Backward pass (∂L/∂θ) | **Reflect**: optimizer LLM analyses success/failure → candidate edits |
|
|
74
|
+
| Gradient accumulation | **Aggregate**: hierarchical merge → one coherent patch |
|
|
75
|
+
| Gradient clipping / `learning_rate` | **Select**: rank edits, keep top-`L` (the *edit budget*) |
|
|
76
|
+
| `optimizer.step()` | **Update**: deterministically apply edits to the skill doc |
|
|
77
|
+
| Validation | **Evaluate**: hold-out gate — accept iff strictly better |
|
|
78
|
+
|
|
79
|
+
**`skillrl` packages this 6-stage pipeline as a TRL-style library**, so you can write:
|
|
80
|
+
|
|
81
|
+
```python
|
|
82
|
+
trainer = SkillOptTrainer(config=cfg, env=env, optimizer_client=..., target_client=...)
|
|
83
|
+
summary = trainer.train()
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
…just like you'd write `PPOTrainer(...).train()` in 🤗 TRL.
|
|
87
|
+
|
|
88
|
+
---
|
|
89
|
+
|
|
90
|
+
## 2. Why TRL-like?
|
|
91
|
+
|
|
92
|
+
| 🤗 TRL | skillrl |
|
|
93
|
+
|-----------------------------|------------------------------------------|
|
|
94
|
+
| `PPOConfig` (dataclass) | `SkillOptConfig` (dataclass) |
|
|
95
|
+
| `PPOTrainer.train()` | `SkillOptTrainer.train()` |
|
|
96
|
+
| Reward model | `SkillEnv` (`rollout_one` returns `hard`/`soft`/`fail_reason`) |
|
|
97
|
+
| Policy model (trainable) | **Skill document** (markdown, trainable) |
|
|
98
|
+
| Reference / value model | Frozen **target_client** |
|
|
99
|
+
| Optimizer (Adam) | Frozen **optimizer_client** + edit budget scheduler |
|
|
100
|
+
| Learning rate | `edit_budget` (max edits per step) + `lr_scheduler` (constant/linear/cosine) |
|
|
101
|
+
| Gradient clipping | LLM-based ranking — keeps top-`L` edits |
|
|
102
|
+
| Validation reward | `selection_split` gate (`hard` / `soft` / `mixed`) |
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## 3. Installation
|
|
107
|
+
|
|
108
|
+
```bash
|
|
109
|
+
# editable install from this repo
|
|
110
|
+
pip install -e .
|
|
111
|
+
|
|
112
|
+
# or with dev extras (pytest)
|
|
113
|
+
pip install -e .[dev]
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Requirements: Python ≥ 3.10, `openai>=1.40.0`. Any OpenAI-compatible endpoint (vLLM / Together / Azure / Moonshot / DeepSeek / …) works out of the box.
|
|
117
|
+
|
|
118
|
+
---
|
|
119
|
+
|
|
120
|
+
## 4. Quick start
|
|
121
|
+
|
|
122
|
+
A minimal end-to-end example using the bundled `SimpleQAEnv`:
|
|
123
|
+
|
|
124
|
+
```python
|
|
125
|
+
from skillrl import SkillOptConfig, SkillOptTrainer
|
|
126
|
+
from skillrl.envs.qa import SimpleQAEnv
|
|
127
|
+
from skillrl.llm.openai_client import OpenAIChatClient
|
|
128
|
+
|
|
129
|
+
# 1. Data
|
|
130
|
+
train = [
|
|
131
|
+
{"id": "1", "question": "Capital of France?", "answers": ["Paris"]},
|
|
132
|
+
{"id": "2", "question": "Largest ocean on Earth?", "answers": ["Pacific Ocean", "Pacific"]},
|
|
133
|
+
# ... 30+ items recommended
|
|
134
|
+
]
|
|
135
|
+
val = [{"id": "v1", "question": "Capital of Japan?", "answers": ["Tokyo"]}]
|
|
136
|
+
test = [{"id": "t1", "question": "Capital of Italy?", "answers": ["Rome"]}]
|
|
137
|
+
|
|
138
|
+
env = SimpleQAEnv(train_items=train, val_items=val, test_items=test)
|
|
139
|
+
|
|
140
|
+
# 2. Backends (optimizer_client = strong; target_client = the agent under training)
|
|
141
|
+
optimizer = OpenAIChatClient(model="gpt-4o") # critic / rewriter
|
|
142
|
+
target = OpenAIChatClient(model="gpt-4o-mini") # the frozen agent
|
|
143
|
+
|
|
144
|
+
# 3. Config (paper-default protocol)
|
|
145
|
+
cfg = SkillOptConfig(
|
|
146
|
+
num_epochs=2,
|
|
147
|
+
batch_size=8,
|
|
148
|
+
minibatch_size=4,
|
|
149
|
+
edit_budget=4,
|
|
150
|
+
lr_scheduler="cosine",
|
|
151
|
+
gate_metric="hard",
|
|
152
|
+
out_root="outputs/qa_demo",
|
|
153
|
+
)
|
|
154
|
+
|
|
155
|
+
# 4. Train
|
|
156
|
+
trainer = SkillOptTrainer(
|
|
157
|
+
config=cfg, env=env,
|
|
158
|
+
optimizer_client=optimizer, target_client=target,
|
|
159
|
+
initial_skill="You are a concise QA assistant. Answer in one short phrase.",
|
|
160
|
+
)
|
|
161
|
+
summary = trainer.train()
|
|
162
|
+
print(summary["best_selection_score"], summary["test_hard"])
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
A runnable version lives at [`examples/train_qa.py`](examples/train_qa.py).
|
|
166
|
+
|
|
167
|
+
After training, the output directory contains everything you need to inspect the run:
|
|
168
|
+
|
|
169
|
+
```
|
|
170
|
+
outputs/qa_demo/
|
|
171
|
+
├── config.json # resolved config
|
|
172
|
+
├── best_skill.md # all-time best skill (deploy this)
|
|
173
|
+
├── current_skill.md # last accepted skill
|
|
174
|
+
├── history.json # per-step records
|
|
175
|
+
├── runtime_state.json # for auto-resume
|
|
176
|
+
├── summary.json # final report
|
|
177
|
+
├── skills/skill_v0001.md ... # per-step snapshots
|
|
178
|
+
├── steps/step_0000/
|
|
179
|
+
│ ├── rollout_results.json
|
|
180
|
+
│ ├── raw_patches.json
|
|
181
|
+
│ ├── merged_patch.json
|
|
182
|
+
│ ├── ranked_patch.json
|
|
183
|
+
│ ├── candidate_skill.md
|
|
184
|
+
│ ├── edit_apply_report.json
|
|
185
|
+
│ ├── selection_eval/ # validation rollouts on this candidate
|
|
186
|
+
│ └── step_record.json
|
|
187
|
+
├── test_eval_baseline/
|
|
188
|
+
└── test_eval_best/
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
If you re-launch with the same `out_root`, training **auto-resumes** from the last completed step.
|
|
192
|
+
|
|
193
|
+
---
|
|
194
|
+
|
|
195
|
+
## 5. The 6-stage pipeline
|
|
196
|
+
|
|
197
|
+
```
|
|
198
|
+
┌──────────────────────────────────────────────────────────────┐
|
|
199
|
+
│ one optimization step │
|
|
200
|
+
│ │
|
|
201
|
+
│ current_skill.md │
|
|
202
|
+
│ │ │
|
|
203
|
+
│ ① ROLLOUT env.rollout_one(item, skill, target_client) │
|
|
204
|
+
│ │ (parallel, n=batch_size) │
|
|
205
|
+
│ ▼ │
|
|
206
|
+
│ trajectories → hard / soft / fail_reason │
|
|
207
|
+
│ │ │
|
|
208
|
+
│ ② REFLECT analyse failure & success minibatches │
|
|
209
|
+
│ │ optimizer_client → JSON {reasoning, edits} │
|
|
210
|
+
│ ▼ │
|
|
211
|
+
│ raw_patches (failure-tagged + success-tagged) │
|
|
212
|
+
│ │ │
|
|
213
|
+
│ ③ AGGREGATE hierarchical merge, failure-first │
|
|
214
|
+
│ │ optimizer_client → one coherent patch │
|
|
215
|
+
│ ▼ │
|
|
216
|
+
│ merged_patch │
|
|
217
|
+
│ │ │
|
|
218
|
+
│ ④ SELECT LLM ranks edits, keep top-L (edit_budget) │
|
|
219
|
+
│ │ ≈ "gradient clipping" in text space │
|
|
220
|
+
│ ▼ │
|
|
221
|
+
│ ranked_patch │
|
|
222
|
+
│ │ │
|
|
223
|
+
│ ⑤ UPDATE apply_patch(skill, ranked_patch) │
|
|
224
|
+
│ │ deterministic, append/insert/replace/delete │
|
|
225
|
+
│ ▼ │
|
|
226
|
+
│ candidate_skill.md │
|
|
227
|
+
│ │ │
|
|
228
|
+
│ ⑥ EVALUATE rollout on selection_split → gate │
|
|
229
|
+
│ │ accept iff strictly better than current_score │
|
|
230
|
+
│ ▼ │
|
|
231
|
+
│ if accept: current_skill := candidate │
|
|
232
|
+
│ if also > best_score: best_skill := candidate │
|
|
233
|
+
│ else: keep current_skill │
|
|
234
|
+
│ │
|
|
235
|
+
└──────────────────────────────────────────────────────────────┘
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
**Edit budget = textual learning rate.** The cap on edits applied per step is decayed (constant / linear / cosine) over the entire training horizon, exactly as the SkillOpt paper does.
|
|
239
|
+
|
|
240
|
+
**Validation gate** is strict: candidates must *strictly* beat `current_score`. A separate `best_skill` is tracked in parallel, so the artifact you ship is always the all-time best.
|
|
241
|
+
|
|
242
|
+
---
|
|
243
|
+
|
|
244
|
+
## 6. Library structure
|
|
245
|
+
|
|
246
|
+
```
|
|
247
|
+
skillrl/
|
|
248
|
+
├── __init__.py # public exports
|
|
249
|
+
├── config.py # SkillOptConfig (dataclass)
|
|
250
|
+
├── types.py # Edit / Patch / RawPatch / RolloutResult / GateResult
|
|
251
|
+
├── trainer.py # SkillOptTrainer — the main loop
|
|
252
|
+
│
|
|
253
|
+
├── core/
|
|
254
|
+
│ ├── editor.py # apply_edit / apply_patch (5-Update)
|
|
255
|
+
│ ├── scheduler.py # constant / linear / cosine edit-budget schedulers
|
|
256
|
+
│ ├── gate.py # validation gate (hard / soft / mixed)
|
|
257
|
+
│ └── utils.py # extract_json, compute_score, skill_hash
|
|
258
|
+
│
|
|
259
|
+
├── llm/
|
|
260
|
+
│ ├── base.py # BaseLLMClient interface
|
|
261
|
+
│ └── openai_client.py # OpenAI / Azure / OpenAI-compatible
|
|
262
|
+
│
|
|
263
|
+
├── pipeline/
|
|
264
|
+
│ ├── rollout.py # 1-Rollout (parallel)
|
|
265
|
+
│ ├── reflect.py # 2-Reflect (failure / success minibatches)
|
|
266
|
+
│ ├── aggregate.py # 3-Aggregate (hierarchical merge, failure-first)
|
|
267
|
+
│ └── select.py # 4-Select (LLM rank + top-L clip)
|
|
268
|
+
│
|
|
269
|
+
├── prompts/ # bundled markdown prompt templates
|
|
270
|
+
│ ├── analyst_error.md
|
|
271
|
+
│ ├── analyst_success.md
|
|
272
|
+
│ ├── merge_failure.md
|
|
273
|
+
│ ├── merge_success.md
|
|
274
|
+
│ ├── merge_final.md
|
|
275
|
+
│ └── ranking.md
|
|
276
|
+
│
|
|
277
|
+
└── envs/
|
|
278
|
+
├── base.py # SkillEnv abstract class
|
|
279
|
+
└── qa.py # SimpleQAEnv (reference implementation)
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
---
|
|
283
|
+
|
|
284
|
+
## 7. Writing your own environment
|
|
285
|
+
|
|
286
|
+
To train a skill on your task, subclass `SkillEnv`:
|
|
287
|
+
|
|
288
|
+
```python
|
|
289
|
+
from skillrl.envs.base import SkillEnv
|
|
290
|
+
from skillrl.types import RolloutResult
|
|
291
|
+
|
|
292
|
+
class MyEnv(SkillEnv):
|
|
293
|
+
name = "my_env"
|
|
294
|
+
|
|
295
|
+
def get_initial_skill(self) -> str:
|
|
296
|
+
return "You are an expert XYZ agent..."
|
|
297
|
+
|
|
298
|
+
def get_items(self, split: str) -> list[dict]:
|
|
299
|
+
return self._splits[split] # train / val / test
|
|
300
|
+
|
|
301
|
+
def rollout_one(self, *, item, skill, target_client) -> RolloutResult:
|
|
302
|
+
# 1) build the conversation; the *skill* is typically the system prompt.
|
|
303
|
+
# 2) call target_client.chat(...) one or more times (multi-turn allowed).
|
|
304
|
+
# 3) score the outcome: hard ∈ {0,1}, soft ∈ [0,1].
|
|
305
|
+
# 4) return RolloutResult(...).
|
|
306
|
+
...
|
|
307
|
+
```
|
|
308
|
+
|
|
309
|
+
That's it — drop it into `SkillOptTrainer` and you're training.
|
|
310
|
+
|
|
311
|
+
> **Tip.** For multi-turn / tool-using agents, return the full `conversation` list and a meaningful `fail_reason`. The Reflect stage uses both to localise *why* the skill failed and *what* to change.
|
|
312
|
+
|
|
313
|
+
---
|
|
314
|
+
|
|
315
|
+
## 8. Customising prompts
|
|
316
|
+
|
|
317
|
+
All optimizer-LLM prompts live in `skillrl/prompts/*.md`. Override any of them per-trainer without modifying the package:
|
|
318
|
+
|
|
319
|
+
```python
|
|
320
|
+
trainer = SkillOptTrainer(
|
|
321
|
+
config=cfg, env=env,
|
|
322
|
+
optimizer_client=opt, target_client=tgt,
|
|
323
|
+
prompt_overrides={
|
|
324
|
+
"analyst_error": open("my_prompts/analyst_error.md").read(),
|
|
325
|
+
"ranking": open("my_prompts/ranking.md").read(),
|
|
326
|
+
},
|
|
327
|
+
)
|
|
328
|
+
```
|
|
329
|
+
|
|
330
|
+
Available keys: `analyst_error`, `analyst_success`, `merge_failure`, `merge_success`, `merge_final`, `ranking`.
|
|
331
|
+
|
|
332
|
+
---
|
|
333
|
+
|
|
334
|
+
## 9. Reproducibility & observability
|
|
335
|
+
|
|
336
|
+
* **Determinism.** The same `seed`, `batch_size`, `minibatch_size`, dataset and backends produce the same minibatch shuffles and analyst groupings.
|
|
337
|
+
* **Auto-resume.** Re-running with the same `out_root` skips already-completed steps (rebuilds the selection cache from `history.json`).
|
|
338
|
+
* **Per-step artifacts.** Every stage's input/output is dumped — easy to diff between steps and reproduce any single step locally.
|
|
339
|
+
* **Selection cache.** Identical candidate skills (by `skill_hash`) reuse cached selection-split scores — saves a *lot* of money on long runs.
|
|
340
|
+
|
|
341
|
+
---
|
|
342
|
+
|
|
343
|
+
## 10. What's NOT in 1.0 (yet)
|
|
344
|
+
|
|
345
|
+
`skillrl 1.0` ships the **core algorithm** as faithfully as possible. The following SkillOpt features are intentionally deferred to future minor releases:
|
|
346
|
+
|
|
347
|
+
* `slow_update` (skill momentum / EMA over accepted skills)
|
|
348
|
+
* `meta_skill` (a meta-document guiding *how* to edit the skill)
|
|
349
|
+
* Autonomous LR (online edit-budget tuning)
|
|
350
|
+
* Gradient accumulation across steps
|
|
351
|
+
* `rewrite` / `full_rewrite_minibatch` update modes
|
|
352
|
+
* Codex / Claude-Code / Qwen / MiniMax execution backends
|
|
353
|
+
* Ray-based distributed rollouts
|
|
354
|
+
* WebUI
|
|
355
|
+
|
|
356
|
+
PRs welcome.
|
|
357
|
+
|
|
358
|
+
---
|
|
359
|
+
|
|
360
|
+
## 11. License
|
|
361
|
+
|
|
362
|
+
MIT. See [LICENSE](LICENSE).
|