skillrl 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (39) hide show
  1. skillrl-1.0.0/LICENSE +21 -0
  2. skillrl-1.0.0/MANIFEST.in +10 -0
  3. skillrl-1.0.0/PKG-INFO +362 -0
  4. skillrl-1.0.0/README.md +306 -0
  5. skillrl-1.0.0/pyproject.toml +52 -0
  6. skillrl-1.0.0/setup.cfg +4 -0
  7. skillrl-1.0.0/skillrl/__init__.py +49 -0
  8. skillrl-1.0.0/skillrl/config.py +148 -0
  9. skillrl-1.0.0/skillrl/core/__init__.py +36 -0
  10. skillrl-1.0.0/skillrl/core/editor.py +110 -0
  11. skillrl-1.0.0/skillrl/core/gate.py +94 -0
  12. skillrl-1.0.0/skillrl/core/scheduler.py +88 -0
  13. skillrl-1.0.0/skillrl/core/utils.py +96 -0
  14. skillrl-1.0.0/skillrl/envs/__init__.py +19 -0
  15. skillrl-1.0.0/skillrl/envs/base.py +52 -0
  16. skillrl-1.0.0/skillrl/envs/qa.py +163 -0
  17. skillrl-1.0.0/skillrl/llm/__init__.py +19 -0
  18. skillrl-1.0.0/skillrl/llm/base.py +56 -0
  19. skillrl-1.0.0/skillrl/llm/openai_client.py +163 -0
  20. skillrl-1.0.0/skillrl/pipeline/__init__.py +22 -0
  21. skillrl-1.0.0/skillrl/pipeline/aggregate.py +220 -0
  22. skillrl-1.0.0/skillrl/pipeline/reflect.py +253 -0
  23. skillrl-1.0.0/skillrl/pipeline/rollout.py +93 -0
  24. skillrl-1.0.0/skillrl/pipeline/select.py +110 -0
  25. skillrl-1.0.0/skillrl/prompts/__init__.py +53 -0
  26. skillrl-1.0.0/skillrl/prompts/analyst_error.md +35 -0
  27. skillrl-1.0.0/skillrl/prompts/analyst_success.md +31 -0
  28. skillrl-1.0.0/skillrl/prompts/merge_failure.md +22 -0
  29. skillrl-1.0.0/skillrl/prompts/merge_final.md +23 -0
  30. skillrl-1.0.0/skillrl/prompts/merge_success.md +19 -0
  31. skillrl-1.0.0/skillrl/prompts/ranking.md +23 -0
  32. skillrl-1.0.0/skillrl/py.typed +0 -0
  33. skillrl-1.0.0/skillrl/trainer.py +714 -0
  34. skillrl-1.0.0/skillrl/types.py +241 -0
  35. skillrl-1.0.0/skillrl.egg-info/PKG-INFO +362 -0
  36. skillrl-1.0.0/skillrl.egg-info/SOURCES.txt +37 -0
  37. skillrl-1.0.0/skillrl.egg-info/dependency_links.txt +1 -0
  38. skillrl-1.0.0/skillrl.egg-info/requires.txt +10 -0
  39. skillrl-1.0.0/skillrl.egg-info/top_level.txt +1 -0
skillrl-1.0.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 skillrl contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,10 @@
1
+ include README.md
2
+ include LICENSE
3
+ include pyproject.toml
4
+ recursive-include skillrl/prompts *.md
5
+ recursive-include skillrl/envs/data *.json
6
+ recursive-exclude tests *
7
+ recursive-exclude examples *
8
+ global-exclude __pycache__
9
+ global-exclude *.py[cod]
10
+ global-exclude .DS_Store
skillrl-1.0.0/PKG-INFO ADDED
@@ -0,0 +1,362 @@
1
+ Metadata-Version: 2.4
2
+ Name: skillrl
3
+ Version: 1.0.0
4
+ Summary: A TRL-like training library for end-to-end skill optimization of frozen LLM agents (based on Microsoft SkillOpt).
5
+ Author: skillrl contributors
6
+ License: MIT License
7
+
8
+ Copyright (c) 2026 skillrl contributors
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
+ SOFTWARE.
27
+
28
+ Project-URL: Homepage, https://github.com/Xia12121/SkillRL
29
+ Project-URL: Repository, https://github.com/Xia12121/SkillRL
30
+ Project-URL: Issues, https://github.com/Xia12121/SkillRL/issues
31
+ Project-URL: Reference Repo, https://github.com/microsoft/SkillOpt
32
+ Keywords: llm,agent,skill,prompt-optimization,text-space-optimization,skillopt,trl
33
+ Classifier: Development Status :: 4 - Beta
34
+ Classifier: Programming Language :: Python :: 3
35
+ Classifier: Programming Language :: Python :: 3.10
36
+ Classifier: Programming Language :: Python :: 3.11
37
+ Classifier: Programming Language :: Python :: 3.12
38
+ Classifier: License :: OSI Approved :: MIT License
39
+ Classifier: Operating System :: OS Independent
40
+ Classifier: Intended Audience :: Science/Research
41
+ Classifier: Intended Audience :: Developers
42
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
43
+ Requires-Python: >=3.10
44
+ Description-Content-Type: text/markdown
45
+ License-File: LICENSE
46
+ Requires-Dist: openai>=1.40.0
47
+ Requires-Dist: pyyaml>=6.0
48
+ Requires-Dist: tqdm>=4.65.0
49
+ Provides-Extra: dev
50
+ Requires-Dist: pytest>=7.0; extra == "dev"
51
+ Requires-Dist: pytest-mock>=3.10; extra == "dev"
52
+ Requires-Dist: ruff>=0.5; extra == "dev"
53
+ Requires-Dist: build>=1.2; extra == "dev"
54
+ Requires-Dist: twine>=5.0; extra == "dev"
55
+ Dynamic: license-file
56
+
57
+ # skillrl
58
+
59
+ > **A TRL-like training library for end-to-end skill optimization of frozen LLM agents.**
60
+ > Implements the core algorithm of Microsoft **SkillOpt** ([project page](https://microsoft.github.io/SkillOpt/), [repo](https://github.com/microsoft/SkillOpt)) as a clean, modular Python package — designed to grow into the *TRL of skill / prompt-space optimization*.
61
+
62
+ ---
63
+
64
+ ## 1. What is this?
65
+
66
+ Modern LLM agents are usually improved either by **fine-tuning weights** (expensive, opaque) or by **hand-tweaking prompts** (cheap, brittle, ad-hoc).
67
+
68
+ **SkillOpt** proposes a third path: treat a **natural-language *skill document*** (a markdown file of guidelines, heuristics, do/don'ts) as the *trainable state*. Both the **target LLM** (the agent that uses the skill) and the **optimizer LLM** (the model that critiques and rewrites the skill) stay **frozen**. Gradient descent is replaced by a textual analogue:
69
+
70
+ | SGD on weights | SkillOpt on skill text |
71
+ |---------------------------------|-------------------------------------------------|
72
+ | Forward pass | **Rollout**: target agent runs the skill on a batch |
73
+ | Backward pass (∂L/∂θ) | **Reflect**: optimizer LLM analyses success/failure → candidate edits |
74
+ | Gradient accumulation | **Aggregate**: hierarchical merge → one coherent patch |
75
+ | Gradient clipping / `learning_rate` | **Select**: rank edits, keep top-`L` (the *edit budget*) |
76
+ | `optimizer.step()` | **Update**: deterministically apply edits to the skill doc |
77
+ | Validation | **Evaluate**: hold-out gate — accept iff strictly better |
78
+
79
+ **`skillrl` packages this 6-stage pipeline as a TRL-style library**, so you can write:
80
+
81
+ ```python
82
+ trainer = SkillOptTrainer(config=cfg, env=env, optimizer_client=..., target_client=...)
83
+ summary = trainer.train()
84
+ ```
85
+
86
+ …just like you'd write `PPOTrainer(...).train()` in 🤗 TRL.
87
+
88
+ ---
89
+
90
+ ## 2. Why TRL-like?
91
+
92
+ | 🤗 TRL | skillrl |
93
+ |-----------------------------|------------------------------------------|
94
+ | `PPOConfig` (dataclass) | `SkillOptConfig` (dataclass) |
95
+ | `PPOTrainer.train()` | `SkillOptTrainer.train()` |
96
+ | Reward model | `SkillEnv` (`rollout_one` returns `hard`/`soft`/`fail_reason`) |
97
+ | Policy model (trainable) | **Skill document** (markdown, trainable) |
98
+ | Reference / value model | Frozen **target_client** |
99
+ | Optimizer (Adam) | Frozen **optimizer_client** + edit budget scheduler |
100
+ | Learning rate | `edit_budget` (max edits per step) + `lr_scheduler` (constant/linear/cosine) |
101
+ | Gradient clipping | LLM-based ranking — keeps top-`L` edits |
102
+ | Validation reward | `selection_split` gate (`hard` / `soft` / `mixed`) |
103
+
104
+ ---
105
+
106
+ ## 3. Installation
107
+
108
+ ```bash
109
+ # editable install from this repo
110
+ pip install -e .
111
+
112
+ # or with dev extras (pytest)
113
+ pip install -e .[dev]
114
+ ```
115
+
116
+ Requirements: Python ≥ 3.10, `openai>=1.40.0`. Any OpenAI-compatible endpoint (vLLM / Together / Azure / Moonshot / DeepSeek / …) works out of the box.
117
+
118
+ ---
119
+
120
+ ## 4. Quick start
121
+
122
+ A minimal end-to-end example using the bundled `SimpleQAEnv`:
123
+
124
+ ```python
125
+ from skillrl import SkillOptConfig, SkillOptTrainer
126
+ from skillrl.envs.qa import SimpleQAEnv
127
+ from skillrl.llm.openai_client import OpenAIChatClient
128
+
129
+ # 1. Data
130
+ train = [
131
+ {"id": "1", "question": "Capital of France?", "answers": ["Paris"]},
132
+ {"id": "2", "question": "Largest ocean on Earth?", "answers": ["Pacific Ocean", "Pacific"]},
133
+ # ... 30+ items recommended
134
+ ]
135
+ val = [{"id": "v1", "question": "Capital of Japan?", "answers": ["Tokyo"]}]
136
+ test = [{"id": "t1", "question": "Capital of Italy?", "answers": ["Rome"]}]
137
+
138
+ env = SimpleQAEnv(train_items=train, val_items=val, test_items=test)
139
+
140
+ # 2. Backends (optimizer_client = strong; target_client = the agent under training)
141
+ optimizer = OpenAIChatClient(model="gpt-4o") # critic / rewriter
142
+ target = OpenAIChatClient(model="gpt-4o-mini") # the frozen agent
143
+
144
+ # 3. Config (paper-default protocol)
145
+ cfg = SkillOptConfig(
146
+ num_epochs=2,
147
+ batch_size=8,
148
+ minibatch_size=4,
149
+ edit_budget=4,
150
+ lr_scheduler="cosine",
151
+ gate_metric="hard",
152
+ out_root="outputs/qa_demo",
153
+ )
154
+
155
+ # 4. Train
156
+ trainer = SkillOptTrainer(
157
+ config=cfg, env=env,
158
+ optimizer_client=optimizer, target_client=target,
159
+ initial_skill="You are a concise QA assistant. Answer in one short phrase.",
160
+ )
161
+ summary = trainer.train()
162
+ print(summary["best_selection_score"], summary["test_hard"])
163
+ ```
164
+
165
+ A runnable version lives at [`examples/train_qa.py`](examples/train_qa.py).
166
+
167
+ After training, the output directory contains everything you need to inspect the run:
168
+
169
+ ```
170
+ outputs/qa_demo/
171
+ ├── config.json # resolved config
172
+ ├── best_skill.md # all-time best skill (deploy this)
173
+ ├── current_skill.md # last accepted skill
174
+ ├── history.json # per-step records
175
+ ├── runtime_state.json # for auto-resume
176
+ ├── summary.json # final report
177
+ ├── skills/skill_v0001.md ... # per-step snapshots
178
+ ├── steps/step_0000/
179
+ │ ├── rollout_results.json
180
+ │ ├── raw_patches.json
181
+ │ ├── merged_patch.json
182
+ │ ├── ranked_patch.json
183
+ │ ├── candidate_skill.md
184
+ │ ├── edit_apply_report.json
185
+ │ ├── selection_eval/ # validation rollouts on this candidate
186
+ │ └── step_record.json
187
+ ├── test_eval_baseline/
188
+ └── test_eval_best/
189
+ ```
190
+
191
+ If you re-launch with the same `out_root`, training **auto-resumes** from the last completed step.
192
+
193
+ ---
194
+
195
+ ## 5. The 6-stage pipeline
196
+
197
+ ```
198
+ ┌──────────────────────────────────────────────────────────────┐
199
+ │ one optimization step │
200
+ │ │
201
+ │ current_skill.md │
202
+ │ │ │
203
+ │ ① ROLLOUT env.rollout_one(item, skill, target_client) │
204
+ │ │ (parallel, n=batch_size) │
205
+ │ ▼ │
206
+ │ trajectories → hard / soft / fail_reason │
207
+ │ │ │
208
+ │ ② REFLECT analyse failure & success minibatches │
209
+ │ │ optimizer_client → JSON {reasoning, edits} │
210
+ │ ▼ │
211
+ │ raw_patches (failure-tagged + success-tagged) │
212
+ │ │ │
213
+ │ ③ AGGREGATE hierarchical merge, failure-first │
214
+ │ │ optimizer_client → one coherent patch │
215
+ │ ▼ │
216
+ │ merged_patch │
217
+ │ │ │
218
+ │ ④ SELECT LLM ranks edits, keep top-L (edit_budget) │
219
+ │ │ ≈ "gradient clipping" in text space │
220
+ │ ▼ │
221
+ │ ranked_patch │
222
+ │ │ │
223
+ │ ⑤ UPDATE apply_patch(skill, ranked_patch) │
224
+ │ │ deterministic, append/insert/replace/delete │
225
+ │ ▼ │
226
+ │ candidate_skill.md │
227
+ │ │ │
228
+ │ ⑥ EVALUATE rollout on selection_split → gate │
229
+ │ │ accept iff strictly better than current_score │
230
+ │ ▼ │
231
+ │ if accept: current_skill := candidate │
232
+ │ if also > best_score: best_skill := candidate │
233
+ │ else: keep current_skill │
234
+ │ │
235
+ └──────────────────────────────────────────────────────────────┘
236
+ ```
237
+
238
+ **Edit budget = textual learning rate.** The cap on edits applied per step is decayed (constant / linear / cosine) over the entire training horizon, exactly as the SkillOpt paper does.
239
+
240
+ **Validation gate** is strict: candidates must *strictly* beat `current_score`. A separate `best_skill` is tracked in parallel, so the artifact you ship is always the all-time best.
241
+
242
+ ---
243
+
244
+ ## 6. Library structure
245
+
246
+ ```
247
+ skillrl/
248
+ ├── __init__.py # public exports
249
+ ├── config.py # SkillOptConfig (dataclass)
250
+ ├── types.py # Edit / Patch / RawPatch / RolloutResult / GateResult
251
+ ├── trainer.py # SkillOptTrainer — the main loop
252
+
253
+ ├── core/
254
+ │ ├── editor.py # apply_edit / apply_patch (5-Update)
255
+ │ ├── scheduler.py # constant / linear / cosine edit-budget schedulers
256
+ │ ├── gate.py # validation gate (hard / soft / mixed)
257
+ │ └── utils.py # extract_json, compute_score, skill_hash
258
+
259
+ ├── llm/
260
+ │ ├── base.py # BaseLLMClient interface
261
+ │ └── openai_client.py # OpenAI / Azure / OpenAI-compatible
262
+
263
+ ├── pipeline/
264
+ │ ├── rollout.py # 1-Rollout (parallel)
265
+ │ ├── reflect.py # 2-Reflect (failure / success minibatches)
266
+ │ ├── aggregate.py # 3-Aggregate (hierarchical merge, failure-first)
267
+ │ └── select.py # 4-Select (LLM rank + top-L clip)
268
+
269
+ ├── prompts/ # bundled markdown prompt templates
270
+ │ ├── analyst_error.md
271
+ │ ├── analyst_success.md
272
+ │ ├── merge_failure.md
273
+ │ ├── merge_success.md
274
+ │ ├── merge_final.md
275
+ │ └── ranking.md
276
+
277
+ └── envs/
278
+ ├── base.py # SkillEnv abstract class
279
+ └── qa.py # SimpleQAEnv (reference implementation)
280
+ ```
281
+
282
+ ---
283
+
284
+ ## 7. Writing your own environment
285
+
286
+ To train a skill on your task, subclass `SkillEnv`:
287
+
288
+ ```python
289
+ from skillrl.envs.base import SkillEnv
290
+ from skillrl.types import RolloutResult
291
+
292
+ class MyEnv(SkillEnv):
293
+ name = "my_env"
294
+
295
+ def get_initial_skill(self) -> str:
296
+ return "You are an expert XYZ agent..."
297
+
298
+ def get_items(self, split: str) -> list[dict]:
299
+ return self._splits[split] # train / val / test
300
+
301
+ def rollout_one(self, *, item, skill, target_client) -> RolloutResult:
302
+ # 1) build the conversation; the *skill* is typically the system prompt.
303
+ # 2) call target_client.chat(...) one or more times (multi-turn allowed).
304
+ # 3) score the outcome: hard ∈ {0,1}, soft ∈ [0,1].
305
+ # 4) return RolloutResult(...).
306
+ ...
307
+ ```
308
+
309
+ That's it — drop it into `SkillOptTrainer` and you're training.
310
+
311
+ > **Tip.** For multi-turn / tool-using agents, return the full `conversation` list and a meaningful `fail_reason`. The Reflect stage uses both to localise *why* the skill failed and *what* to change.
312
+
313
+ ---
314
+
315
+ ## 8. Customising prompts
316
+
317
+ All optimizer-LLM prompts live in `skillrl/prompts/*.md`. Override any of them per-trainer without modifying the package:
318
+
319
+ ```python
320
+ trainer = SkillOptTrainer(
321
+ config=cfg, env=env,
322
+ optimizer_client=opt, target_client=tgt,
323
+ prompt_overrides={
324
+ "analyst_error": open("my_prompts/analyst_error.md").read(),
325
+ "ranking": open("my_prompts/ranking.md").read(),
326
+ },
327
+ )
328
+ ```
329
+
330
+ Available keys: `analyst_error`, `analyst_success`, `merge_failure`, `merge_success`, `merge_final`, `ranking`.
331
+
332
+ ---
333
+
334
+ ## 9. Reproducibility & observability
335
+
336
+ * **Determinism.** The same `seed`, `batch_size`, `minibatch_size`, dataset and backends produce the same minibatch shuffles and analyst groupings.
337
+ * **Auto-resume.** Re-running with the same `out_root` skips already-completed steps (rebuilds the selection cache from `history.json`).
338
+ * **Per-step artifacts.** Every stage's input/output is dumped — easy to diff between steps and reproduce any single step locally.
339
+ * **Selection cache.** Identical candidate skills (by `skill_hash`) reuse cached selection-split scores — saves a *lot* of money on long runs.
340
+
341
+ ---
342
+
343
+ ## 10. What's NOT in 1.0 (yet)
344
+
345
+ `skillrl 1.0` ships the **core algorithm** as faithfully as possible. The following SkillOpt features are intentionally deferred to future minor releases:
346
+
347
+ * `slow_update` (skill momentum / EMA over accepted skills)
348
+ * `meta_skill` (a meta-document guiding *how* to edit the skill)
349
+ * Autonomous LR (online edit-budget tuning)
350
+ * Gradient accumulation across steps
351
+ * `rewrite` / `full_rewrite_minibatch` update modes
352
+ * Codex / Claude-Code / Qwen / MiniMax execution backends
353
+ * Ray-based distributed rollouts
354
+ * WebUI
355
+
356
+ PRs welcome.
357
+
358
+ ---
359
+
360
+ ## 11. License
361
+
362
+ MIT. See [LICENSE](LICENSE).