@synsci/cli-linux-x64-musl 1.1.50 → 1.1.52
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/tinker/SKILL.md +64 -6
- package/bin/skills/tinker/references/dpo-and-preference.md +174 -0
- package/bin/skills/tinker/references/evaluations.md +183 -0
- package/bin/skills/tinker/references/getting-started.md +1 -1
- package/bin/skills/tinker/references/models-and-lora.md +12 -3
- package/bin/skills/tinker/references/recipes.md +48 -2
- package/bin/skills/tinker/references/reinforcement-learning.md +153 -8
- package/bin/skills/tinker/references/rendering.md +13 -1
- package/bin/skills/tinker/references/supervised-learning.md +31 -7
- package/bin/synsc +0 -0
- package/package.json +1 -1
|
@@ -4,7 +4,7 @@ description: Provides guidance for fine-tuning LLMs using the Tinker cloud train
|
|
|
4
4
|
version: 1.0.0
|
|
5
5
|
author: Synthetic Sciences
|
|
6
6
|
license: MIT
|
|
7
|
-
tags: [Fine-Tuning, Tinker, LoRA, Reinforcement Learning, Supervised Learning, Cloud Training, Vision-Language Models]
|
|
7
|
+
tags: [Fine-Tuning, Tinker, LoRA, Reinforcement Learning, Supervised Learning, DPO, RLHF, Cloud Training, Vision-Language Models]
|
|
8
8
|
dependencies: [tinker, tinker-cookbook, chz, transformers>=4.40.0, datasets, numpy]
|
|
9
9
|
---
|
|
10
10
|
|
|
@@ -44,10 +44,12 @@ Expert guidance for fine-tuning large language models using Tinker's managed clo
|
|
|
44
44
|
| Setup & Core Concepts | [Getting Started](references/getting-started.md) |
|
|
45
45
|
| API Classes & Types | [API Reference](references/api-reference.md) |
|
|
46
46
|
| Supervised Learning | [Supervised Learning](references/supervised-learning.md) |
|
|
47
|
-
| RL Training | [Reinforcement Learning](references/reinforcement-learning.md) |
|
|
47
|
+
| RL Training & Environments | [Reinforcement Learning](references/reinforcement-learning.md) |
|
|
48
|
+
| DPO, RLHF & Distillation | [DPO & Preference Learning](references/dpo-and-preference.md) |
|
|
48
49
|
| Loss Functions | [Loss Functions](references/loss-functions.md) |
|
|
49
50
|
| Chat Templates | [Rendering](references/rendering.md) |
|
|
50
51
|
| Models & LoRA | [Models & LoRA](references/models-and-lora.md) |
|
|
52
|
+
| Evaluations | [Evaluations](references/evaluations.md) |
|
|
51
53
|
| Example Scripts | [Recipes](references/recipes.md) |
|
|
52
54
|
|
|
53
55
|
## Installation
|
|
@@ -81,6 +83,7 @@ from tinker_cookbook.supervised.types import ChatDatasetBuilderCommonConfig
|
|
|
81
83
|
from tinker_cookbook.supervised.data import FromConversationFileBuilder
|
|
82
84
|
from tinker_cookbook.renderers import TrainOnWhat
|
|
83
85
|
from tinker_cookbook.model_info import get_recommended_renderer_name
|
|
86
|
+
from tinker_cookbook.hyperparam_utils import get_lr
|
|
84
87
|
|
|
85
88
|
model_name = "Qwen/Qwen3-30B-A3B"
|
|
86
89
|
renderer_name = get_recommended_renderer_name(model_name)
|
|
@@ -102,8 +105,8 @@ blueprint = chz.Blueprint(train.Config).apply({
|
|
|
102
105
|
"log_path": "/tmp/sft-run",
|
|
103
106
|
"model_name": model_name,
|
|
104
107
|
"dataset_builder": dataset_builder,
|
|
105
|
-
"learning_rate":
|
|
106
|
-
"lr_schedule": "
|
|
108
|
+
"learning_rate": get_lr(model_name),
|
|
109
|
+
"lr_schedule": "linear",
|
|
107
110
|
"num_epochs": 3,
|
|
108
111
|
"lora_rank": 32,
|
|
109
112
|
})
|
|
@@ -274,14 +277,66 @@ training_client = service_client.create_lora_training_client(
|
|
|
274
277
|
|
|
275
278
|
| Parameter | SFT Default | RL Default | Notes |
|
|
276
279
|
|-----------|-------------|------------|-------|
|
|
277
|
-
| `learning_rate` |
|
|
280
|
+
| `learning_rate` | `get_lr(model)` | 4e-5 | Model-dependent; ~5e-4 for Qwen3-30B, ~2.8e-4 for Llama-8B |
|
|
278
281
|
| `batch_size` | 128 | 128 | Smaller generally better for fine-tuning |
|
|
279
282
|
| `lora_rank` | 32 | 32 | Higher rank = more capacity |
|
|
280
283
|
| `group_size` | N/A | 16 | Rollouts per problem for RL |
|
|
281
284
|
| `max_length` | 2048-32768 | N/A | Sequence length for SFT |
|
|
282
285
|
| `max_tokens` | N/A | 256 | Max generation length for RL |
|
|
283
286
|
| `num_epochs` | 1-3 | N/A | Training passes |
|
|
284
|
-
| `lr_schedule` |
|
|
287
|
+
| `lr_schedule` | linear | N/A | Only `linear` and `constant` supported |
|
|
288
|
+
|
|
289
|
+
## Workflow 3: DPO (Preference Learning)
|
|
290
|
+
|
|
291
|
+
Use this for aligning models with human preferences without a separate reward model.
|
|
292
|
+
|
|
293
|
+
### Quick Start
|
|
294
|
+
|
|
295
|
+
```bash
|
|
296
|
+
python -m tinker_cookbook.recipes.preference.train \
|
|
297
|
+
log_path=/tmp/dpo-experiment \
|
|
298
|
+
model_name=meta-llama/Llama-3.2-1B \
|
|
299
|
+
dataset=hhh \
|
|
300
|
+
renderer_name=role_colon \
|
|
301
|
+
learning_rate=1e-5 \
|
|
302
|
+
dpo_beta=0.1
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
**Key differences from SFT**: Use lower LR (1e-5 to 1e-6), base model should be in-distribution with preference data.
|
|
306
|
+
|
|
307
|
+
**Available datasets**: `hhh` (Anthropic), `helpsteer3` (NVIDIA), `ultrafeedback`
|
|
308
|
+
|
|
309
|
+
**Full RLHF pipeline**: See [DPO & Preference Learning](references/dpo-and-preference.md) for the three-step SL → preference model → RL pipeline.
|
|
310
|
+
|
|
311
|
+
---
|
|
312
|
+
|
|
313
|
+
## Evaluations
|
|
314
|
+
|
|
315
|
+
### Inline (During Training)
|
|
316
|
+
|
|
317
|
+
Add `evaluator_builders` to config for periodic evaluation:
|
|
318
|
+
|
|
319
|
+
```python
|
|
320
|
+
blueprint = chz.Blueprint(train.Config).apply({
|
|
321
|
+
...
|
|
322
|
+
"evaluator_builders": [my_evaluator],
|
|
323
|
+
"eval_every": 8,
|
|
324
|
+
})
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
### Offline (After Training)
|
|
328
|
+
|
|
329
|
+
```bash
|
|
330
|
+
MODEL_PATH=tinker://YOUR_MODEL_PATH_HERE
|
|
331
|
+
python -m tinker_cookbook.eval.run_inspect_evals \
|
|
332
|
+
model_path=$MODEL_PATH \
|
|
333
|
+
model_name=MODEL_NAME \
|
|
334
|
+
tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot
|
|
335
|
+
```
|
|
336
|
+
|
|
337
|
+
See [Evaluations](references/evaluations.md) for custom evaluators and LLM-as-judge.
|
|
338
|
+
|
|
339
|
+
---
|
|
285
340
|
|
|
286
341
|
## Cost Estimation & Usage Tracking
|
|
287
342
|
|
|
@@ -322,6 +377,9 @@ The CLI tracks all Tinker usage and reports it to the Synthetic Sciences dashboa
|
|
|
322
377
|
| Reward stuck at 0 | Debug reward function independently, check answer extraction |
|
|
323
378
|
| All advantages = 0 | Increase group size, ensure reward variance across completions |
|
|
324
379
|
| Wrong tokenizer | Use model-specific tokenizer (see Models & LoRA reference) |
|
|
380
|
+
| `Unknown learning rate schedule` | Only `"linear"` and `"constant"` are supported; `"cosine"` does NOT work |
|
|
381
|
+
| Python 3.14 pydantic errors | Tinker requires Python 3.10-3.13; pydantic v1 is incompatible with 3.14+ |
|
|
382
|
+
| Only 1 step per epoch | `batch_size` too large for dataset size; aim for 100+ steps per epoch |
|
|
325
383
|
|
|
326
384
|
## Saving and Resuming
|
|
327
385
|
|
|
@@ -0,0 +1,174 @@
|
|
|
1
|
+
# DPO & Preference Learning
|
|
2
|
+
|
|
3
|
+
## Direct Preference Optimization (DPO)
|
|
4
|
+
|
|
5
|
+
DPO trains models to prefer chosen responses over rejected ones using a classification loss, without needing a separate reward model.
|
|
6
|
+
|
|
7
|
+
### Quick Start
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
python -m tinker_cookbook.recipes.preference.train \
|
|
11
|
+
log_path=/tmp/dpo-experiment \
|
|
12
|
+
model_name=meta-llama/Llama-3.2-1B \
|
|
13
|
+
dataset=hhh \
|
|
14
|
+
renderer_name=role_colon \
|
|
15
|
+
learning_rate=1e-5 \
|
|
16
|
+
dpo_beta=0.1
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
### Key Parameters
|
|
20
|
+
|
|
21
|
+
| Parameter | Description | Recommended |
|
|
22
|
+
|-----------|-------------|-------------|
|
|
23
|
+
| `model_name` | Base model (also used as reference policy) | Start with 1B-8B |
|
|
24
|
+
| `dataset` | Preference dataset name | `hhh`, `helpsteer3`, `ultrafeedback` |
|
|
25
|
+
| `renderer_name` | Chat format renderer | Match model family |
|
|
26
|
+
| `learning_rate` | LR for optimization | 1e-5 to 1e-6 (lower than SFT) |
|
|
27
|
+
| `dpo_beta` | Preference strength | Start with 0.1 |
|
|
28
|
+
| `log_path` | Output directory | Required |
|
|
29
|
+
|
|
30
|
+
### Available Datasets
|
|
31
|
+
|
|
32
|
+
| Dataset | Source | Description |
|
|
33
|
+
|---------|--------|-------------|
|
|
34
|
+
| `hhh` | Anthropic | Helpful-Harmless-Honest pairwise comparisons |
|
|
35
|
+
| `helpsteer3` | NVIDIA | HelpSteer3 preference dataset |
|
|
36
|
+
| `ultrafeedback` | UltraFeedback | Binarized preferences |
|
|
37
|
+
|
|
38
|
+
Custom datasets: implement `DPODatasetBuilder` from `tinker_cookbook.preference.preference_datasets`.
|
|
39
|
+
|
|
40
|
+
### Training Metrics
|
|
41
|
+
|
|
42
|
+
| Metric | Description | Watch For |
|
|
43
|
+
|--------|-------------|-----------|
|
|
44
|
+
| `dpo_loss` | DPO classification loss | Should decrease |
|
|
45
|
+
| `accuracy` | Implicit reward model accuracy | Should increase |
|
|
46
|
+
| `margin` | Chosen - rejected reward gap | Should increase |
|
|
47
|
+
| `chosen_reward` | Average reward for chosen responses | Higher is better |
|
|
48
|
+
| `rejected_reward` | Average reward for rejected responses | Lower is better |
|
|
49
|
+
|
|
50
|
+
### Tips
|
|
51
|
+
|
|
52
|
+
- **Beta parameter**: Start with `dpo_beta=0.1`, adjust based on dataset
|
|
53
|
+
- **Learning rate**: Use lower LR than SFT (1e-5 to 1e-6)
|
|
54
|
+
- **Base model**: Should already be in-distribution with the preference data. Either start with a light SFT phase or collect on-policy preferences. Sharp distribution mismatch creates strange behaviors
|
|
55
|
+
- **Evaluation**: Use Inspect AI to evaluate after training (see [Evaluations](evaluations.md))
|
|
56
|
+
|
|
57
|
+
### Evaluating DPO Models
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
MODEL_PATH=tinker://YOUR_MODEL_PATH_HERE
|
|
61
|
+
python -m tinker_cookbook.eval.run_inspect_evals \
|
|
62
|
+
model_path=$MODEL_PATH \
|
|
63
|
+
model_name=meta-llama/Llama-3.2-1B \
|
|
64
|
+
tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
## RLHF Pipeline
|
|
68
|
+
|
|
69
|
+
Full pipeline: SL → Preference Model → RL, implemented in `recipes/preference/rlhf/rlhf_pipeline.py`.
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
python -m recipes.preference.rlhf.rlhf_pipeline
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### Step 1: Train Initial Policy (SL)
|
|
76
|
+
|
|
77
|
+
Train on instruction-following data (e.g., no_robots from HuggingFace) to match InstructGPT methodology.
|
|
78
|
+
|
|
79
|
+
### Step 2: Train Preference Model (SL)
|
|
80
|
+
|
|
81
|
+
Train on pairwise comparison data (e.g., HHH dataset from Anthropic). Model sees completions A and B, outputs which is preferred.
|
|
82
|
+
|
|
83
|
+
### Step 3: Train Policy via RL
|
|
84
|
+
|
|
85
|
+
Use preference model as reward signal. For each prompt:
|
|
86
|
+
1. Sample multiple completions from the policy
|
|
87
|
+
2. Use preference model to grade all pairs
|
|
88
|
+
3. Give reward based on win fraction (self-play)
|
|
89
|
+
|
|
90
|
+
## Prompt Distillation
|
|
91
|
+
|
|
92
|
+
Train a model to behave as if given a long prompt, without needing that prompt at inference time.
|
|
93
|
+
|
|
94
|
+
### How It Works
|
|
95
|
+
|
|
96
|
+
1. **Teacher generates data**: Long, detailed teacher prompt + queries → responses
|
|
97
|
+
2. **Student trains on data**: Student model fine-tunes on (query, response) pairs without the teacher prompt
|
|
98
|
+
|
|
99
|
+
### Example: Language Classification
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
# Step 1: Generate training data with teacher model
|
|
103
|
+
python -m tinker_cookbook.recipes.prompt_distillation.create_data \
|
|
104
|
+
output_file=/tmp/tinker-datasets/prompt_distillation_lang.jsonl
|
|
105
|
+
|
|
106
|
+
# Step 2: Train student model on distilled data
|
|
107
|
+
python -m tinker_cookbook.recipes.prompt_distillation.train
|
|
108
|
+
|
|
109
|
+
# Step 3: Test distilled model
|
|
110
|
+
# Sample from trained model to verify performance
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
### When to Use
|
|
114
|
+
|
|
115
|
+
- System prompt grows impractically long and model starts ignoring instructions
|
|
116
|
+
- Need fast inference without long context overhead
|
|
117
|
+
- Want to specialize a model for a narrow task distribution
|
|
118
|
+
- Teacher and student can be the same model (self-distillation)
|
|
119
|
+
|
|
120
|
+
### Advanced Configuration
|
|
121
|
+
|
|
122
|
+
- **Teacher model selection**: Choose based on quality requirements
|
|
123
|
+
- **Sampling strategies**: Adjust temperature and generation parameters
|
|
124
|
+
- **Data volume**: Scale generated examples based on task complexity
|
|
125
|
+
- **Training hyperparameters**: Follow standard SL hyperparameter guidance
|
|
126
|
+
|
|
127
|
+
## LR Sweep Methodology
|
|
128
|
+
|
|
129
|
+
For finding task-specific optimal LR:
|
|
130
|
+
|
|
131
|
+
### Setup
|
|
132
|
+
|
|
133
|
+
```python
|
|
134
|
+
from tinker_cookbook.hyperparam_utils import get_lr
|
|
135
|
+
default_lr = get_lr("meta-llama/Llama-3.1-8B") # ~2.8e-4
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Sweep Range
|
|
139
|
+
|
|
140
|
+
Sweep one order of magnitude above and below default:
|
|
141
|
+
|
|
142
|
+
```bash
|
|
143
|
+
# Launch in parallel (separate terminals)
|
|
144
|
+
python -m tinker_cookbook.recipes.sl_loop learning_rate=0.003 log_path=/tmp/sweep/lr-0.003
|
|
145
|
+
python -m tinker_cookbook.recipes.sl_loop learning_rate=0.001 log_path=/tmp/sweep/lr-0.001
|
|
146
|
+
python -m tinker_cookbook.recipes.sl_loop learning_rate=0.0003 log_path=/tmp/sweep/lr-0.0003
|
|
147
|
+
python -m tinker_cookbook.recipes.sl_loop learning_rate=0.0001 log_path=/tmp/sweep/lr-0.0001
|
|
148
|
+
python -m tinker_cookbook.recipes.sl_loop learning_rate=0.00003 log_path=/tmp/sweep/lr-0.00003
|
|
149
|
+
python -m tinker_cookbook.recipes.sl_loop learning_rate=0.00001 log_path=/tmp/sweep/lr-0.00001
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
### Collect and Visualize
|
|
153
|
+
|
|
154
|
+
```python
|
|
155
|
+
from glob import glob
|
|
156
|
+
import pandas, json, os
|
|
157
|
+
|
|
158
|
+
data = []
|
|
159
|
+
for fname in sorted(glob("/tmp/sweep/*/metrics.jsonl")):
|
|
160
|
+
df = pandas.read_json(fname, lines=True)
|
|
161
|
+
if len(df) == 0 or df["progress"].iloc[-1] < 0.98:
|
|
162
|
+
continue
|
|
163
|
+
config = json.load(open(fname.replace("metrics.jsonl", "config.json")))
|
|
164
|
+
data.append({
|
|
165
|
+
"learning_rate": config["learning_rate"],
|
|
166
|
+
"final_loss": df["train_mean_nll"].iloc[-1].item()
|
|
167
|
+
})
|
|
168
|
+
|
|
169
|
+
df = pandas.DataFrame(data)
|
|
170
|
+
optimal_lr = df["learning_rate"][df["final_loss"].idxmin()]
|
|
171
|
+
print(f"Optimal LR: {optimal_lr:.2e}")
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
Expected result: U-shaped curve with optimal LR near the `get_lr()` default.
|
|
@@ -0,0 +1,183 @@
|
|
|
1
|
+
# Evaluations
|
|
2
|
+
|
|
3
|
+
## Inline Evals (During Training)
|
|
4
|
+
|
|
5
|
+
Add evaluations that run periodically during training.
|
|
6
|
+
|
|
7
|
+
### Supervised Fine-Tuning
|
|
8
|
+
|
|
9
|
+
```python
|
|
10
|
+
blueprint = chz.Blueprint(train.Config).apply({
|
|
11
|
+
"model_name": model_name,
|
|
12
|
+
"dataset_builder": dataset_builder,
|
|
13
|
+
"evaluator_builders": [my_evaluator], # Runs every eval_every steps
|
|
14
|
+
"infrequent_evaluator_builders": [heavy_eval], # Runs every infrequent_eval_every steps
|
|
15
|
+
"eval_every": 8,
|
|
16
|
+
"infrequent_eval_every": 50,
|
|
17
|
+
})
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
### RL Training
|
|
21
|
+
|
|
22
|
+
```python
|
|
23
|
+
blueprint = chz.Blueprint(train.Config).apply({
|
|
24
|
+
"model_name": model_name,
|
|
25
|
+
"dataset_builder": builder,
|
|
26
|
+
"evaluator_builders": [sampling_eval], # SamplingClientEvaluator instances
|
|
27
|
+
"eval_every": 5,
|
|
28
|
+
})
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## Offline Evals with Inspect AI
|
|
32
|
+
|
|
33
|
+
Run standard evaluations on checkpoints using the Inspect AI library:
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
MODEL_PATH=tinker://YOUR_MODEL_PATH_HERE
|
|
37
|
+
python -m tinker_cookbook.eval.run_inspect_evals \
|
|
38
|
+
model_path=$MODEL_PATH \
|
|
39
|
+
model_name=MODEL_NAME \
|
|
40
|
+
tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot \
|
|
41
|
+
renderer_name=RENDERER_NAME
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
### Creating Custom Inspect AI Tasks
|
|
45
|
+
|
|
46
|
+
```python
|
|
47
|
+
import tinker
|
|
48
|
+
from inspect_ai import Task, task
|
|
49
|
+
from inspect_ai.dataset import MemoryDataset, Sample
|
|
50
|
+
from inspect_ai.model import GenerateConfig as InspectAIGenerateConfig
|
|
51
|
+
from inspect_ai.model import Model as InspectAIModel
|
|
52
|
+
from inspect_ai.scorer import model_graded_qa
|
|
53
|
+
from inspect_ai.solver import generate
|
|
54
|
+
from tinker_cookbook.eval.inspect_utils import InspectAPIFromTinkerSampling
|
|
55
|
+
|
|
56
|
+
QA_DATASET = MemoryDataset(
|
|
57
|
+
name="qa_dataset",
|
|
58
|
+
samples=[
|
|
59
|
+
Sample(input="What is the capital of France?", target="Paris"),
|
|
60
|
+
Sample(input="What is the capital of Italy?", target="Rome"),
|
|
61
|
+
],
|
|
62
|
+
)
|
|
63
|
+
|
|
64
|
+
service_client = tinker.ServiceClient()
|
|
65
|
+
sampling_client = service_client.create_sampling_client(
|
|
66
|
+
base_model="meta-llama/Llama-3.1-8B-Instruct"
|
|
67
|
+
)
|
|
68
|
+
|
|
69
|
+
api = InspectAPIFromTinkerSampling(
|
|
70
|
+
renderer_name="llama3",
|
|
71
|
+
model_name="meta-llama/Llama-3.1-8B-Instruct",
|
|
72
|
+
sampling_client=sampling_client,
|
|
73
|
+
verbose=False,
|
|
74
|
+
)
|
|
75
|
+
|
|
76
|
+
GRADER_MODEL = InspectAIModel(api=api, config=InspectAIGenerateConfig())
|
|
77
|
+
|
|
78
|
+
@task
|
|
79
|
+
def example_lm_as_judge() -> Task:
|
|
80
|
+
return Task(
|
|
81
|
+
name="llm_as_judge",
|
|
82
|
+
dataset=QA_DATASET,
|
|
83
|
+
solver=generate(),
|
|
84
|
+
scorer=model_graded_qa(
|
|
85
|
+
instructions="Grade strictly. Respond 'GRADE: C' if correct or 'GRADE: I' otherwise.",
|
|
86
|
+
partial_credit=False,
|
|
87
|
+
model=GRADER_MODEL,
|
|
88
|
+
),
|
|
89
|
+
)
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
Inspect also supports any OpenAI-compatible API (e.g., openrouter) as the grader model.
|
|
93
|
+
|
|
94
|
+
## Custom SamplingClientEvaluator
|
|
95
|
+
|
|
96
|
+
Lower-level abstraction with fine-grained control:
|
|
97
|
+
|
|
98
|
+
```python
|
|
99
|
+
from typing import Any, Callable
|
|
100
|
+
import tinker
|
|
101
|
+
from tinker import types
|
|
102
|
+
from tinker_cookbook import renderers
|
|
103
|
+
from tinker_cookbook.evaluators import SamplingClientEvaluator
|
|
104
|
+
from tinker_cookbook.tokenizer_utils import get_tokenizer
|
|
105
|
+
|
|
106
|
+
class CustomEvaluator(SamplingClientEvaluator):
|
|
107
|
+
def __init__(
|
|
108
|
+
self,
|
|
109
|
+
dataset: Any,
|
|
110
|
+
grader_fn: Callable[[str, str], bool],
|
|
111
|
+
model_name: str,
|
|
112
|
+
renderer_name: str,
|
|
113
|
+
):
|
|
114
|
+
self.dataset = dataset
|
|
115
|
+
self.grader_fn = grader_fn
|
|
116
|
+
tokenizer = get_tokenizer(model_name)
|
|
117
|
+
self.renderer = renderers.get_renderer(name=renderer_name, tokenizer=tokenizer)
|
|
118
|
+
|
|
119
|
+
async def __call__(self, sampling_client: tinker.SamplingClient) -> dict[str, float]:
|
|
120
|
+
sampling_params = types.SamplingParams(
|
|
121
|
+
max_tokens=100,
|
|
122
|
+
temperature=0.7,
|
|
123
|
+
top_p=1.0,
|
|
124
|
+
stop=self.renderer.get_stop_sequences(),
|
|
125
|
+
)
|
|
126
|
+
|
|
127
|
+
num_correct = 0
|
|
128
|
+
for datum in self.dataset:
|
|
129
|
+
model_input = self.renderer.build_generation_prompt(
|
|
130
|
+
[renderers.Message(role="user", content=datum["input"])]
|
|
131
|
+
)
|
|
132
|
+
r = await sampling_client.sample_async(
|
|
133
|
+
prompt=model_input, num_samples=1, sampling_params=sampling_params
|
|
134
|
+
)
|
|
135
|
+
tokens = r.sequences[0].tokens
|
|
136
|
+
response = self.renderer.parse_response(tokens)[0]
|
|
137
|
+
if self.grader_fn(response["content"], datum["output"]):
|
|
138
|
+
num_correct += 1
|
|
139
|
+
|
|
140
|
+
return {"accuracy": num_correct / len(self.dataset)}
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
### Usage
|
|
144
|
+
|
|
145
|
+
```python
|
|
146
|
+
import asyncio
|
|
147
|
+
|
|
148
|
+
QA_DATASET = [
|
|
149
|
+
{"input": "What is the capital of France?", "output": "Paris"},
|
|
150
|
+
{"input": "What is the capital of Germany?", "output": "Berlin"},
|
|
151
|
+
]
|
|
152
|
+
|
|
153
|
+
def grader_fn(response: str, target: str) -> bool:
|
|
154
|
+
return target.lower() in response.lower()
|
|
155
|
+
|
|
156
|
+
evaluator = CustomEvaluator(
|
|
157
|
+
dataset=QA_DATASET,
|
|
158
|
+
grader_fn=grader_fn,
|
|
159
|
+
renderer_name="llama3",
|
|
160
|
+
model_name="meta-llama/Llama-3.1-8B-Instruct",
|
|
161
|
+
)
|
|
162
|
+
|
|
163
|
+
service_client = tinker.ServiceClient()
|
|
164
|
+
sampling_client = service_client.create_sampling_client(
|
|
165
|
+
base_model="meta-llama/Llama-3.1-8B-Instruct"
|
|
166
|
+
)
|
|
167
|
+
|
|
168
|
+
async def main():
|
|
169
|
+
result = await evaluator(sampling_client)
|
|
170
|
+
print(result) # {"accuracy": 1.0}
|
|
171
|
+
|
|
172
|
+
asyncio.run(main())
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
## Evaluation Strategy
|
|
176
|
+
|
|
177
|
+
| Stage | Method | When |
|
|
178
|
+
|-------|--------|------|
|
|
179
|
+
| During SFT | `evaluator_builders` | Every N training steps |
|
|
180
|
+
| During RL | `evaluator_builders` (SamplingClientEvaluator) | Every N iterations |
|
|
181
|
+
| After training | `run_inspect_evals` CLI | On final checkpoint |
|
|
182
|
+
| Custom tasks | Custom `SamplingClientEvaluator` | Any time with sampling client |
|
|
183
|
+
| LLM-as-judge | Inspect AI with `model_graded_qa` | When automated grading needed |
|
|
@@ -61,7 +61,7 @@ import numpy as np
|
|
|
61
61
|
|
|
62
62
|
for _ in range(6):
|
|
63
63
|
fwdbwd_future = training_client.forward_backward(processed_examples, "cross_entropy")
|
|
64
|
-
optim_future = training_client.optim_step(types.AdamParams(learning_rate=
|
|
64
|
+
optim_future = training_client.optim_step(types.AdamParams(learning_rate=2e-4)) # Use get_lr() for production
|
|
65
65
|
|
|
66
66
|
fwdbwd_result = fwdbwd_future.result()
|
|
67
67
|
optim_result = optim_future.result()
|
|
@@ -3,7 +3,7 @@
|
|
|
3
3
|
## Model Selection Guide
|
|
4
4
|
|
|
5
5
|
- **Use MoE models** - More cost effective than dense
|
|
6
|
-
- **Base models** -
|
|
6
|
+
- **Base models** - For LoRA-based post-training research (no built-in chat format)
|
|
7
7
|
- **Instruction models** - Fast inference, no chain-of-thought
|
|
8
8
|
- **Hybrid/Reasoning models** - Long chain-of-thought for quality
|
|
9
9
|
|
|
@@ -36,7 +36,7 @@
|
|
|
36
36
|
**Sizes:** Compact (1-4B), Small (8B), Medium (30-32B), Large (70B+)
|
|
37
37
|
|
|
38
38
|
**Types:**
|
|
39
|
-
- **Base**: Pretrained, for post-training research
|
|
39
|
+
- **Base**: Pretrained, no chat formatting — use for LoRA post-training research
|
|
40
40
|
- **Instruction**: Chat-tuned, fast inference
|
|
41
41
|
- **Hybrid**: Thinking + non-thinking modes
|
|
42
42
|
- **Reasoning**: Always uses chain-of-thought
|
|
@@ -51,6 +51,12 @@ LoRA (Low-Rank Adaptation) fine-tunes small parameter subset instead of all weig
|
|
|
51
51
|
- SL on small-medium instruction datasets: **Same as full fine-tuning**
|
|
52
52
|
- RL: **Equivalent to full fine-tuning even with small ranks**
|
|
53
53
|
- Large datasets: May underperform (increase rank)
|
|
54
|
+
- LoRA performs better when applied to **all weight matrices** (attention + MLP + MoE). Attention-only LoRA underperforms even with matched parameter counts
|
|
55
|
+
|
|
56
|
+
### LoRA Limitations
|
|
57
|
+
|
|
58
|
+
- **Large batch sizes**: LoRA is less tolerant of large batch sizes than full FT — pays a larger loss penalty as batch size increases beyond some point. This penalty is NOT mitigated by increasing rank; it's a property of the product-of-matrices parametrization
|
|
59
|
+
- **Large SL datasets**: When dataset exceeds LoRA capacity, results in worse training efficiency (not a distinct floor)
|
|
54
60
|
|
|
55
61
|
### LoRA Learning Rate
|
|
56
62
|
|
|
@@ -61,7 +67,10 @@ from tinker_cookbook.hyperparam_utils import get_lora_lr_over_full_finetune_lr
|
|
|
61
67
|
|
|
62
68
|
model_name = "meta-llama/Llama-3.1-8B"
|
|
63
69
|
factor = get_lora_lr_over_full_finetune_lr(model_name)
|
|
64
|
-
#
|
|
70
|
+
# Factor varies by model size:
|
|
71
|
+
# Llama-3.2-1B → 32
|
|
72
|
+
# Llama-3.1-8B → ~50
|
|
73
|
+
# Llama-3.1-70B → 128
|
|
65
74
|
```
|
|
66
75
|
|
|
67
76
|
### Recommended Learning Rate
|
|
@@ -12,6 +12,7 @@ from tinker_cookbook.renderers import TrainOnWhat
|
|
|
12
12
|
from tinker_cookbook.supervised import train
|
|
13
13
|
from tinker_cookbook.supervised.data import FromConversationFileBuilder
|
|
14
14
|
from tinker_cookbook.supervised.types import ChatDatasetBuilderCommonConfig
|
|
15
|
+
from tinker_cookbook.hyperparam_utils import get_lr
|
|
15
16
|
|
|
16
17
|
def build_config_blueprint() -> chz.Blueprint[train.Config]:
|
|
17
18
|
model_name = "meta-llama/Llama-3.1-8B"
|
|
@@ -35,7 +36,7 @@ def build_config_blueprint() -> chz.Blueprint[train.Config]:
|
|
|
35
36
|
"log_path": "/tmp/tinker-examples/sl_basic",
|
|
36
37
|
"model_name": model_name,
|
|
37
38
|
"dataset_builder": dataset,
|
|
38
|
-
"learning_rate":
|
|
39
|
+
"learning_rate": get_lr(model_name),
|
|
39
40
|
"lr_schedule": "linear",
|
|
40
41
|
"num_epochs": 1,
|
|
41
42
|
"eval_every": 8,
|
|
@@ -61,13 +62,14 @@ from tinker_cookbook import checkpoint_utils, model_info, renderers
|
|
|
61
62
|
from tinker_cookbook.supervised.common import compute_mean_nll
|
|
62
63
|
from tinker_cookbook.supervised.data import conversation_to_datum
|
|
63
64
|
from tinker_cookbook.tokenizer_utils import get_tokenizer
|
|
65
|
+
from tinker_cookbook.hyperparam_utils import get_lr
|
|
64
66
|
|
|
65
67
|
@chz.chz
|
|
66
68
|
class Config:
|
|
67
69
|
log_path: str = "/tmp/tinker-examples/sl-loop"
|
|
68
70
|
model_name: str = "meta-llama/Llama-3.1-8B"
|
|
69
71
|
batch_size: int = 128
|
|
70
|
-
learning_rate: float =
|
|
72
|
+
learning_rate: float = get_lr("meta-llama/Llama-3.1-8B") # ~2.8e-4
|
|
71
73
|
max_length: int = 32768
|
|
72
74
|
train_on_what: renderers.TrainOnWhat = renderers.TrainOnWhat.ALL_ASSISTANT_MESSAGES
|
|
73
75
|
lora_rank: int = 32
|
|
@@ -257,6 +259,41 @@ if __name__ == "__main__":
|
|
|
257
259
|
chz.nested_entrypoint(main)
|
|
258
260
|
```
|
|
259
261
|
|
|
262
|
+
## DPO / Preference Learning
|
|
263
|
+
|
|
264
|
+
```bash
|
|
265
|
+
python -m tinker_cookbook.recipes.preference.train \
|
|
266
|
+
log_path=/tmp/dpo-experiment \
|
|
267
|
+
model_name=meta-llama/Llama-3.2-1B \
|
|
268
|
+
dataset=hhh \
|
|
269
|
+
renderer_name=role_colon \
|
|
270
|
+
learning_rate=1e-5 \
|
|
271
|
+
dpo_beta=0.1
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
Available datasets: `hhh` (Anthropic), `helpsteer3` (NVIDIA), `ultrafeedback`
|
|
275
|
+
|
|
276
|
+
See [DPO & Preference Learning](dpo-and-preference.md) for full RLHF pipeline and parameters.
|
|
277
|
+
|
|
278
|
+
## Prompt Distillation
|
|
279
|
+
|
|
280
|
+
```bash
|
|
281
|
+
# Step 1: Generate training data with teacher model
|
|
282
|
+
python -m tinker_cookbook.recipes.prompt_distillation.create_data \
|
|
283
|
+
output_file=/tmp/tinker-datasets/prompt_distillation_lang.jsonl
|
|
284
|
+
|
|
285
|
+
# Step 2: Train student model on distilled data
|
|
286
|
+
python -m tinker_cookbook.recipes.prompt_distillation.train
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
## Multi-Step RL (Twenty Questions)
|
|
290
|
+
|
|
291
|
+
```bash
|
|
292
|
+
python -m tinker_cookbook.recipes.twenty_questions.train
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
Complete multi-step environment where a question-asking agent learns to guess hidden words. Good reference for building custom multi-turn RL environments.
|
|
296
|
+
|
|
260
297
|
## Running Recipes
|
|
261
298
|
|
|
262
299
|
```bash
|
|
@@ -271,6 +308,15 @@ python -m tinker_cookbook.recipes.rl_basic
|
|
|
271
308
|
|
|
272
309
|
# Manual RL loop
|
|
273
310
|
python -m tinker_cookbook.recipes.rl_loop
|
|
311
|
+
|
|
312
|
+
# DPO
|
|
313
|
+
python -m tinker_cookbook.recipes.preference.train
|
|
314
|
+
|
|
315
|
+
# Prompt distillation
|
|
316
|
+
python -m tinker_cookbook.recipes.prompt_distillation.train
|
|
317
|
+
|
|
318
|
+
# Multi-step RL
|
|
319
|
+
python -m tinker_cookbook.recipes.twenty_questions.train
|
|
274
320
|
```
|
|
275
321
|
|
|
276
322
|
## CLI Overrides
|
|
@@ -11,6 +11,8 @@ Fine-tunes Llama-3.1-8B on GSM8K with reward:
|
|
|
11
11
|
1[answer correct] + 0.1 * (1[format correct] - 1)
|
|
12
12
|
```
|
|
13
13
|
|
|
14
|
+
Training takes ~1 min/iteration, reaches ~63% accuracy after 15 iterations.
|
|
15
|
+
|
|
14
16
|
## Basic RL Config
|
|
15
17
|
|
|
16
18
|
```python
|
|
@@ -54,6 +56,93 @@ if __name__ == "__main__":
|
|
|
54
56
|
- `entropy` - Per-token entropy
|
|
55
57
|
- `kl_sample_train_v1/v2` - KL divergence (sampler vs learner)
|
|
56
58
|
|
|
59
|
+
## RL Environment Classes
|
|
60
|
+
|
|
61
|
+
Custom RL environments use three classes from `tinker_cookbook.rl.types`:
|
|
62
|
+
|
|
63
|
+
### Env
|
|
64
|
+
|
|
65
|
+
Stateful environment for a single agent episode. Discard after one episode.
|
|
66
|
+
|
|
67
|
+
```python
|
|
68
|
+
from tinker_cookbook.rl.types import Env, Observation, StopCondition, Action, StepResult
|
|
69
|
+
|
|
70
|
+
class MyEnv(Env):
|
|
71
|
+
async def initial_observation(self) -> tuple[Observation, StopCondition]:
|
|
72
|
+
# Return initial tokens and stop condition
|
|
73
|
+
raise NotImplementedError
|
|
74
|
+
|
|
75
|
+
async def step(self, action: Action) -> StepResult:
|
|
76
|
+
# Process agent action, return next observation + reward
|
|
77
|
+
raise NotImplementedError
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
Note: `Env` operates on tokens (not strings/messages) because the training code needs exact tokens and logprobs.
|
|
81
|
+
|
|
82
|
+
### EnvGroupBuilder
|
|
83
|
+
|
|
84
|
+
Creates groups of environments (enables multi-agent training or paired comparisons):
|
|
85
|
+
|
|
86
|
+
```python
|
|
87
|
+
from tinker_cookbook.rl.types import EnvGroupBuilder
|
|
88
|
+
|
|
89
|
+
class MyEnvGroupBuilder(EnvGroupBuilder):
|
|
90
|
+
async def make_envs(self) -> list[Env]:
|
|
91
|
+
return [MyEnv() for _ in range(group_size)]
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
### RLDataset
|
|
95
|
+
|
|
96
|
+
Dataset of `EnvGroupBuilder` objects:
|
|
97
|
+
|
|
98
|
+
```python
|
|
99
|
+
from tinker_cookbook.rl.types import RLDataset
|
|
100
|
+
|
|
101
|
+
class MyDataset(RLDataset):
|
|
102
|
+
def get_batch(self, index: int) -> list[EnvGroupBuilder]:
|
|
103
|
+
return [MyEnvGroupBuilder(problem) for problem in self.problems[index]]
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
### Multi-Step Environment Example
|
|
107
|
+
|
|
108
|
+
See `tinker_cookbook.recipes.twenty_questions` for a complete multi-step environment where a question-asking agent learns to guess hidden words:
|
|
109
|
+
|
|
110
|
+
```bash
|
|
111
|
+
python -m tinker_cookbook.recipes.twenty_questions.train
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
## Completers (Policy Abstractions)
|
|
115
|
+
|
|
116
|
+
Completers represent models/policies that can be sampled from.
|
|
117
|
+
|
|
118
|
+
### TokenCompleter (for RL training)
|
|
119
|
+
|
|
120
|
+
```python
|
|
121
|
+
from tinker_cookbook.rl.types import TokenCompleter, TokensWithLogprobs
|
|
122
|
+
|
|
123
|
+
class TokenCompleter:
|
|
124
|
+
async def __call__(
|
|
125
|
+
self, model_input: types.ModelInput, stop: StopCondition
|
|
126
|
+
) -> TokensWithLogprobs:
|
|
127
|
+
...
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
Used by RL algorithms because they work directly with tokens.
|
|
131
|
+
|
|
132
|
+
### MessageCompleter (for inference/judging)
|
|
133
|
+
|
|
134
|
+
```python
|
|
135
|
+
from tinker_cookbook.rl.types import MessageCompleter
|
|
136
|
+
|
|
137
|
+
class MessageCompleter:
|
|
138
|
+
async def __call__(self, messages: list[renderers.Message]) -> renderers.Message:
|
|
139
|
+
...
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
Operates at message level — useful for judge models, multi-agent environments, and evaluation. Requires a renderer to convert messages to tokens for the sampling client.
|
|
143
|
+
|
|
144
|
+
Concrete implementations: `TinkerTokenCompleter` and `TinkerMessageCompleter` (wrappers around `tinker.SamplingClient`).
|
|
145
|
+
|
|
57
146
|
## Custom RL Loop
|
|
58
147
|
|
|
59
148
|
```python
|
|
@@ -143,12 +232,18 @@ def main(config: Config):
|
|
|
143
232
|
|
|
144
233
|
## Hyperparameters
|
|
145
234
|
|
|
235
|
+
### Learning Rate
|
|
236
|
+
|
|
237
|
+
Same guidance as SL — use `get_lr(model_name)` as starting point. The rl_basic recipe uses `4e-5` for Llama-3.1-8B.
|
|
238
|
+
|
|
146
239
|
### Batch and Group Sizes
|
|
147
240
|
|
|
148
|
-
- `batch_size`: Number of unique problems
|
|
241
|
+
- `batch_size`: Number of unique problems per iteration
|
|
149
242
|
- `group_size`: Rollouts per problem (for variance reduction)
|
|
150
243
|
|
|
151
|
-
Scale: `LR ∝ √batch_size`
|
|
244
|
+
Scale LR proportionally: `LR ∝ √batch_size`
|
|
245
|
+
|
|
246
|
+
If you have limited problems, increase `group_size` to generate more training data.
|
|
152
247
|
|
|
153
248
|
### Multiple Updates (num_substeps)
|
|
154
249
|
|
|
@@ -157,14 +252,14 @@ Scale: `LR ∝ √batch_size`
|
|
|
157
252
|
num_substeps = 1
|
|
158
253
|
|
|
159
254
|
# Multiple updates: split batch into mini-batches
|
|
160
|
-
num_substeps = 4 # Batch must be divisible
|
|
255
|
+
num_substeps = 4 # Batch must be divisible by num_substeps
|
|
161
256
|
```
|
|
162
257
|
|
|
163
|
-
Use with PPO objective. Start with 2-4.
|
|
258
|
+
Use with PPO objective. Start with 2-4. Higher values risk out-of-distribution updates.
|
|
164
259
|
|
|
165
260
|
### Streaming Minibatch Training
|
|
166
261
|
|
|
167
|
-
Overlaps sampling and training for throughput:
|
|
262
|
+
Overlaps sampling and training for throughput (on-policy, pipeline efficiency only):
|
|
168
263
|
|
|
169
264
|
```python
|
|
170
265
|
StreamMinibatchConfig(
|
|
@@ -175,22 +270,72 @@ StreamMinibatchConfig(
|
|
|
175
270
|
|
|
176
271
|
### Async Off-Policy Training
|
|
177
272
|
|
|
178
|
-
For long rollouts:
|
|
273
|
+
For long rollouts (CoT, tool use, agentic workflows):
|
|
179
274
|
|
|
180
275
|
```python
|
|
181
276
|
AsyncConfig(
|
|
182
|
-
max_steps_off_policy=3, # Max age of trajectories
|
|
277
|
+
max_steps_off_policy=3, # Max age of trajectories before discard
|
|
183
278
|
groups_per_batch=64,
|
|
184
279
|
)
|
|
185
280
|
```
|
|
186
281
|
|
|
282
|
+
Start with `max_steps_off_policy < 5`. Monitor KL divergence carefully.
|
|
283
|
+
|
|
284
|
+
## Sequence Extension Property
|
|
285
|
+
|
|
286
|
+
Critical for multi-turn RL efficiency. When consecutive timesteps satisfy the extension property (each observation is a prefix extension of the previous), compute scales O(T) instead of O(T²).
|
|
287
|
+
|
|
288
|
+
### When Extension Holds
|
|
289
|
+
|
|
290
|
+
With `Qwen3Renderer(strip_thinking_from_history=False)`, thinking blocks are preserved:
|
|
291
|
+
|
|
292
|
+
```
|
|
293
|
+
Timestep 1: [User: Q1] [A: <think>...</think> A1] [User:]
|
|
294
|
+
Timestep 2: [User: Q1] [A: <think>...</think> A1] [User: Q2] [A: <think>...</think> A2] [User:]
|
|
295
|
+
```
|
|
296
|
+
|
|
297
|
+
Timestep 2 contains timestep 1 as a prefix → single Datum, KV-cache reuse.
|
|
298
|
+
|
|
299
|
+
### When Extension Breaks
|
|
300
|
+
|
|
301
|
+
With default `strip_thinking_from_history=True`, `<think>` blocks are stripped from history:
|
|
302
|
+
|
|
303
|
+
```
|
|
304
|
+
Timestep 1: [User: Q1] [A: <think>...</think> A1] [User:]
|
|
305
|
+
Timestep 2: [User: Q1] [A: A1] [User: Q2] [A: <think>...</think> A2] [User:]
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
Prefix doesn't match → separate Datums per timestep, O(T²) compute.
|
|
309
|
+
|
|
310
|
+
### Check Programmatically
|
|
311
|
+
|
|
312
|
+
```python
|
|
313
|
+
renderer.has_extension_property # True or False
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
For `Qwen3Renderer`:
|
|
317
|
+
- `strip_thinking_from_history=False` → `has_extension_property=True`
|
|
318
|
+
- `strip_thinking_from_history=True` (default) → `has_extension_property=False`
|
|
319
|
+
|
|
320
|
+
### Periodic Compaction (Hybrid)
|
|
321
|
+
|
|
322
|
+
Keep thinking visible most of the time, periodically strip old thinking blocks:
|
|
323
|
+
|
|
324
|
+
- Turns 1-N: keep thinking visible (extension holds, single datum)
|
|
325
|
+
- Turn N+1: strip thinking from turns 1-N (extension breaks once)
|
|
326
|
+
- Turns N+1 to 2N: keep thinking visible again
|
|
327
|
+
- Repeat every N turns
|
|
328
|
+
|
|
329
|
+
This amortizes recomputation cost over N turns with bounded context growth.
|
|
330
|
+
|
|
187
331
|
## Monitoring
|
|
188
332
|
|
|
189
333
|
### KL Divergence
|
|
190
334
|
|
|
191
335
|
Monitor `kl_sample_train_v1/v2`:
|
|
192
336
|
- Should stay below 0.01 for stable training
|
|
193
|
-
-
|
|
337
|
+
- Even on-policy training shows non-zero KL (implementation detail)
|
|
338
|
+
- KL crossing threshold indicates numerical instability
|
|
194
339
|
|
|
195
340
|
### Reward Curves
|
|
196
341
|
|
|
@@ -49,10 +49,22 @@ model_input, weights = renderer.build_supervised_example(
|
|
|
49
49
|
messages,
|
|
50
50
|
train_on_what=TrainOnWhat.ALL_ASSISTANT_MESSAGES,
|
|
51
51
|
)
|
|
52
|
-
# model_input: ModelInput (
|
|
52
|
+
# model_input: ModelInput (token sequence)
|
|
53
53
|
# weights: per-token weights (0.0 = prompt, 1.0 = train)
|
|
54
54
|
```
|
|
55
55
|
|
|
56
|
+
**Default behavior:** If `train_on_what` is omitted, defaults to `LAST_ASSISTANT_MESSAGE`. Always specify explicitly to avoid surprises.
|
|
57
|
+
|
|
58
|
+
```python
|
|
59
|
+
# WRONG — silently trains only on last assistant message
|
|
60
|
+
model_input, weights = renderer.build_supervised_example(messages)
|
|
61
|
+
|
|
62
|
+
# RIGHT — explicitly train on all assistant messages
|
|
63
|
+
model_input, weights = renderer.build_supervised_example(
|
|
64
|
+
messages, train_on_what=TrainOnWhat.ALL_ASSISTANT_MESSAGES
|
|
65
|
+
)
|
|
66
|
+
```
|
|
67
|
+
|
|
56
68
|
### build_generation_prompt
|
|
57
69
|
|
|
58
70
|
For inference:
|
|
@@ -17,6 +17,7 @@ from tinker_cookbook.supervised.types import ChatDatasetBuilderCommonConfig
|
|
|
17
17
|
from tinker_cookbook.supervised.data import FromConversationFileBuilder
|
|
18
18
|
from tinker_cookbook.renderers import TrainOnWhat
|
|
19
19
|
from tinker_cookbook.model_info import get_recommended_renderer_name
|
|
20
|
+
from tinker_cookbook.hyperparam_utils import get_lr
|
|
20
21
|
|
|
21
22
|
def build_config_blueprint() -> chz.Blueprint[train.Config]:
|
|
22
23
|
model_name = "meta-llama/Llama-3.1-8B"
|
|
@@ -39,8 +40,8 @@ def build_config_blueprint() -> chz.Blueprint[train.Config]:
|
|
|
39
40
|
"log_path": "/tmp/training",
|
|
40
41
|
"model_name": model_name,
|
|
41
42
|
"dataset_builder": dataset_builder,
|
|
42
|
-
"learning_rate":
|
|
43
|
-
"lr_schedule": "
|
|
43
|
+
"learning_rate": get_lr(model_name),
|
|
44
|
+
"lr_schedule": "linear",
|
|
44
45
|
"num_epochs": 3,
|
|
45
46
|
"lora_rank": 32,
|
|
46
47
|
})
|
|
@@ -182,19 +183,23 @@ class CustomDataset(SupervisedDataset):
|
|
|
182
183
|
def __iter__(self):
|
|
183
184
|
for item in self.data:
|
|
184
185
|
messages = self._preprocess(item)
|
|
185
|
-
|
|
186
|
+
# build_supervised_example returns (model_input, weights) tuple
|
|
187
|
+
model_input, weights = self.renderer.build_supervised_example(
|
|
186
188
|
messages=messages,
|
|
187
189
|
train_on_what=TrainOnWhat.ALL_ASSISTANT_MESSAGES,
|
|
188
190
|
)
|
|
191
|
+
tokens = model_input.to_ints()
|
|
189
192
|
yield Datum(
|
|
190
|
-
model_input=ModelInput([
|
|
193
|
+
model_input=ModelInput.from_ints(tokens=tokens[:-1]),
|
|
191
194
|
loss_fn_inputs={
|
|
192
|
-
"target_tokens": TensorData.from_numpy(np.array(
|
|
193
|
-
"weights": TensorData.from_numpy(np.array(
|
|
195
|
+
"target_tokens": TensorData.from_numpy(np.array(tokens[1:], dtype=np.int64)),
|
|
196
|
+
"weights": TensorData.from_numpy(np.array(weights[1:], dtype=np.float32)),
|
|
194
197
|
},
|
|
195
198
|
)
|
|
196
199
|
```
|
|
197
200
|
|
|
201
|
+
**Note:** `build_supervised_example` returns a tuple `(model_input, weights)`. The `model_input` is a `ModelInput` (token sequence), and `weights` is a list of per-token floats (0.0 for prompt tokens, 1.0 for completion tokens). When constructing `Datum` manually, shift by one position: input is `tokens[:-1]`, targets are `tokens[1:]`, weights are `weights[1:]`.
|
|
202
|
+
|
|
198
203
|
## Hyperparameters
|
|
199
204
|
|
|
200
205
|
### Learning Rate
|
|
@@ -215,7 +220,26 @@ Formula: `LR = lr_base * M_LoRA * (2000/H_m)^P_m`
|
|
|
215
220
|
|
|
216
221
|
- Smaller batch sizes (128) generally better for fine-tuning
|
|
217
222
|
- Scale LR with `LR ∝ √batch_size`
|
|
218
|
-
- Aim for at least 100 steps of training
|
|
223
|
+
- Aim for at least 100 steps of training (1000+ steps for best results)
|
|
224
|
+
|
|
225
|
+
## Checkpoints
|
|
226
|
+
|
|
227
|
+
Two types of saved weights:
|
|
228
|
+
|
|
229
|
+
| Method | Path Contains | Use Case |
|
|
230
|
+
|--------|--------------|----------|
|
|
231
|
+
| `save_weights_for_sampler()` | `/sampler_weights/` | Sampling/inference only (lightweight) |
|
|
232
|
+
| `save_state()` | `/weights/` | Full optimizer state for resuming training |
|
|
233
|
+
|
|
234
|
+
```python
|
|
235
|
+
# Save for sampling
|
|
236
|
+
path = training_client.save_weights_for_sampler(name="final").result().path
|
|
237
|
+
sampling_client = service_client.create_sampling_client(model_path=path)
|
|
238
|
+
|
|
239
|
+
# Save full state for resuming
|
|
240
|
+
state_path = training_client.save_state(name="checkpoint").result().path
|
|
241
|
+
training_client.load_state(state_path) # Resume later
|
|
242
|
+
```
|
|
219
243
|
|
|
220
244
|
## Output Files
|
|
221
245
|
|
package/bin/synsc
CHANGED
|
Binary file
|