@synsci/cli-linux-x64-musl 1.1.50 → 1.1.52

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -4,7 +4,7 @@ description: Provides guidance for fine-tuning LLMs using the Tinker cloud train
4
4
  version: 1.0.0
5
5
  author: Synthetic Sciences
6
6
  license: MIT
7
- tags: [Fine-Tuning, Tinker, LoRA, Reinforcement Learning, Supervised Learning, Cloud Training, Vision-Language Models]
7
+ tags: [Fine-Tuning, Tinker, LoRA, Reinforcement Learning, Supervised Learning, DPO, RLHF, Cloud Training, Vision-Language Models]
8
8
  dependencies: [tinker, tinker-cookbook, chz, transformers>=4.40.0, datasets, numpy]
9
9
  ---
10
10
 
@@ -44,10 +44,12 @@ Expert guidance for fine-tuning large language models using Tinker's managed clo
44
44
  | Setup & Core Concepts | [Getting Started](references/getting-started.md) |
45
45
  | API Classes & Types | [API Reference](references/api-reference.md) |
46
46
  | Supervised Learning | [Supervised Learning](references/supervised-learning.md) |
47
- | RL Training | [Reinforcement Learning](references/reinforcement-learning.md) |
47
+ | RL Training & Environments | [Reinforcement Learning](references/reinforcement-learning.md) |
48
+ | DPO, RLHF & Distillation | [DPO & Preference Learning](references/dpo-and-preference.md) |
48
49
  | Loss Functions | [Loss Functions](references/loss-functions.md) |
49
50
  | Chat Templates | [Rendering](references/rendering.md) |
50
51
  | Models & LoRA | [Models & LoRA](references/models-and-lora.md) |
52
+ | Evaluations | [Evaluations](references/evaluations.md) |
51
53
  | Example Scripts | [Recipes](references/recipes.md) |
52
54
 
53
55
  ## Installation
@@ -81,6 +83,7 @@ from tinker_cookbook.supervised.types import ChatDatasetBuilderCommonConfig
81
83
  from tinker_cookbook.supervised.data import FromConversationFileBuilder
82
84
  from tinker_cookbook.renderers import TrainOnWhat
83
85
  from tinker_cookbook.model_info import get_recommended_renderer_name
86
+ from tinker_cookbook.hyperparam_utils import get_lr
84
87
 
85
88
  model_name = "Qwen/Qwen3-30B-A3B"
86
89
  renderer_name = get_recommended_renderer_name(model_name)
@@ -102,8 +105,8 @@ blueprint = chz.Blueprint(train.Config).apply({
102
105
  "log_path": "/tmp/sft-run",
103
106
  "model_name": model_name,
104
107
  "dataset_builder": dataset_builder,
105
- "learning_rate": 2e-4,
106
- "lr_schedule": "cosine",
108
+ "learning_rate": get_lr(model_name),
109
+ "lr_schedule": "linear",
107
110
  "num_epochs": 3,
108
111
  "lora_rank": 32,
109
112
  })
@@ -274,14 +277,66 @@ training_client = service_client.create_lora_training_client(
274
277
 
275
278
  | Parameter | SFT Default | RL Default | Notes |
276
279
  |-----------|-------------|------------|-------|
277
- | `learning_rate` | 2e-4 | 4e-5 | Use `get_lr(model_name)` for recommended |
280
+ | `learning_rate` | `get_lr(model)` | 4e-5 | Model-dependent; ~5e-4 for Qwen3-30B, ~2.8e-4 for Llama-8B |
278
281
  | `batch_size` | 128 | 128 | Smaller generally better for fine-tuning |
279
282
  | `lora_rank` | 32 | 32 | Higher rank = more capacity |
280
283
  | `group_size` | N/A | 16 | Rollouts per problem for RL |
281
284
  | `max_length` | 2048-32768 | N/A | Sequence length for SFT |
282
285
  | `max_tokens` | N/A | 256 | Max generation length for RL |
283
286
  | `num_epochs` | 1-3 | N/A | Training passes |
284
- | `lr_schedule` | cosine | N/A | LR decay schedule |
287
+ | `lr_schedule` | linear | N/A | Only `linear` and `constant` supported |
288
+
289
+ ## Workflow 3: DPO (Preference Learning)
290
+
291
+ Use this for aligning models with human preferences without a separate reward model.
292
+
293
+ ### Quick Start
294
+
295
+ ```bash
296
+ python -m tinker_cookbook.recipes.preference.train \
297
+ log_path=/tmp/dpo-experiment \
298
+ model_name=meta-llama/Llama-3.2-1B \
299
+ dataset=hhh \
300
+ renderer_name=role_colon \
301
+ learning_rate=1e-5 \
302
+ dpo_beta=0.1
303
+ ```
304
+
305
+ **Key differences from SFT**: Use lower LR (1e-5 to 1e-6), base model should be in-distribution with preference data.
306
+
307
+ **Available datasets**: `hhh` (Anthropic), `helpsteer3` (NVIDIA), `ultrafeedback`
308
+
309
+ **Full RLHF pipeline**: See [DPO & Preference Learning](references/dpo-and-preference.md) for the three-step SL → preference model → RL pipeline.
310
+
311
+ ---
312
+
313
+ ## Evaluations
314
+
315
+ ### Inline (During Training)
316
+
317
+ Add `evaluator_builders` to config for periodic evaluation:
318
+
319
+ ```python
320
+ blueprint = chz.Blueprint(train.Config).apply({
321
+ ...
322
+ "evaluator_builders": [my_evaluator],
323
+ "eval_every": 8,
324
+ })
325
+ ```
326
+
327
+ ### Offline (After Training)
328
+
329
+ ```bash
330
+ MODEL_PATH=tinker://YOUR_MODEL_PATH_HERE
331
+ python -m tinker_cookbook.eval.run_inspect_evals \
332
+ model_path=$MODEL_PATH \
333
+ model_name=MODEL_NAME \
334
+ tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot
335
+ ```
336
+
337
+ See [Evaluations](references/evaluations.md) for custom evaluators and LLM-as-judge.
338
+
339
+ ---
285
340
 
286
341
  ## Cost Estimation & Usage Tracking
287
342
 
@@ -322,6 +377,9 @@ The CLI tracks all Tinker usage and reports it to the Synthetic Sciences dashboa
322
377
  | Reward stuck at 0 | Debug reward function independently, check answer extraction |
323
378
  | All advantages = 0 | Increase group size, ensure reward variance across completions |
324
379
  | Wrong tokenizer | Use model-specific tokenizer (see Models & LoRA reference) |
380
+ | `Unknown learning rate schedule` | Only `"linear"` and `"constant"` are supported; `"cosine"` does NOT work |
381
+ | Python 3.14 pydantic errors | Tinker requires Python 3.10-3.13; pydantic v1 is incompatible with 3.14+ |
382
+ | Only 1 step per epoch | `batch_size` too large for dataset size; aim for 100+ steps per epoch |
325
383
 
326
384
  ## Saving and Resuming
327
385
 
@@ -0,0 +1,174 @@
1
+ # DPO & Preference Learning
2
+
3
+ ## Direct Preference Optimization (DPO)
4
+
5
+ DPO trains models to prefer chosen responses over rejected ones using a classification loss, without needing a separate reward model.
6
+
7
+ ### Quick Start
8
+
9
+ ```bash
10
+ python -m tinker_cookbook.recipes.preference.train \
11
+ log_path=/tmp/dpo-experiment \
12
+ model_name=meta-llama/Llama-3.2-1B \
13
+ dataset=hhh \
14
+ renderer_name=role_colon \
15
+ learning_rate=1e-5 \
16
+ dpo_beta=0.1
17
+ ```
18
+
19
+ ### Key Parameters
20
+
21
+ | Parameter | Description | Recommended |
22
+ |-----------|-------------|-------------|
23
+ | `model_name` | Base model (also used as reference policy) | Start with 1B-8B |
24
+ | `dataset` | Preference dataset name | `hhh`, `helpsteer3`, `ultrafeedback` |
25
+ | `renderer_name` | Chat format renderer | Match model family |
26
+ | `learning_rate` | LR for optimization | 1e-5 to 1e-6 (lower than SFT) |
27
+ | `dpo_beta` | Preference strength | Start with 0.1 |
28
+ | `log_path` | Output directory | Required |
29
+
30
+ ### Available Datasets
31
+
32
+ | Dataset | Source | Description |
33
+ |---------|--------|-------------|
34
+ | `hhh` | Anthropic | Helpful-Harmless-Honest pairwise comparisons |
35
+ | `helpsteer3` | NVIDIA | HelpSteer3 preference dataset |
36
+ | `ultrafeedback` | UltraFeedback | Binarized preferences |
37
+
38
+ Custom datasets: implement `DPODatasetBuilder` from `tinker_cookbook.preference.preference_datasets`.
39
+
40
+ ### Training Metrics
41
+
42
+ | Metric | Description | Watch For |
43
+ |--------|-------------|-----------|
44
+ | `dpo_loss` | DPO classification loss | Should decrease |
45
+ | `accuracy` | Implicit reward model accuracy | Should increase |
46
+ | `margin` | Chosen - rejected reward gap | Should increase |
47
+ | `chosen_reward` | Average reward for chosen responses | Higher is better |
48
+ | `rejected_reward` | Average reward for rejected responses | Lower is better |
49
+
50
+ ### Tips
51
+
52
+ - **Beta parameter**: Start with `dpo_beta=0.1`, adjust based on dataset
53
+ - **Learning rate**: Use lower LR than SFT (1e-5 to 1e-6)
54
+ - **Base model**: Should already be in-distribution with the preference data. Either start with a light SFT phase or collect on-policy preferences. Sharp distribution mismatch creates strange behaviors
55
+ - **Evaluation**: Use Inspect AI to evaluate after training (see [Evaluations](evaluations.md))
56
+
57
+ ### Evaluating DPO Models
58
+
59
+ ```bash
60
+ MODEL_PATH=tinker://YOUR_MODEL_PATH_HERE
61
+ python -m tinker_cookbook.eval.run_inspect_evals \
62
+ model_path=$MODEL_PATH \
63
+ model_name=meta-llama/Llama-3.2-1B \
64
+ tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot
65
+ ```
66
+
67
+ ## RLHF Pipeline
68
+
69
+ Full pipeline: SL → Preference Model → RL, implemented in `recipes/preference/rlhf/rlhf_pipeline.py`.
70
+
71
+ ```bash
72
+ python -m recipes.preference.rlhf.rlhf_pipeline
73
+ ```
74
+
75
+ ### Step 1: Train Initial Policy (SL)
76
+
77
+ Train on instruction-following data (e.g., no_robots from HuggingFace) to match InstructGPT methodology.
78
+
79
+ ### Step 2: Train Preference Model (SL)
80
+
81
+ Train on pairwise comparison data (e.g., HHH dataset from Anthropic). Model sees completions A and B, outputs which is preferred.
82
+
83
+ ### Step 3: Train Policy via RL
84
+
85
+ Use preference model as reward signal. For each prompt:
86
+ 1. Sample multiple completions from the policy
87
+ 2. Use preference model to grade all pairs
88
+ 3. Give reward based on win fraction (self-play)
89
+
90
+ ## Prompt Distillation
91
+
92
+ Train a model to behave as if given a long prompt, without needing that prompt at inference time.
93
+
94
+ ### How It Works
95
+
96
+ 1. **Teacher generates data**: Long, detailed teacher prompt + queries → responses
97
+ 2. **Student trains on data**: Student model fine-tunes on (query, response) pairs without the teacher prompt
98
+
99
+ ### Example: Language Classification
100
+
101
+ ```bash
102
+ # Step 1: Generate training data with teacher model
103
+ python -m tinker_cookbook.recipes.prompt_distillation.create_data \
104
+ output_file=/tmp/tinker-datasets/prompt_distillation_lang.jsonl
105
+
106
+ # Step 2: Train student model on distilled data
107
+ python -m tinker_cookbook.recipes.prompt_distillation.train
108
+
109
+ # Step 3: Test distilled model
110
+ # Sample from trained model to verify performance
111
+ ```
112
+
113
+ ### When to Use
114
+
115
+ - System prompt grows impractically long and model starts ignoring instructions
116
+ - Need fast inference without long context overhead
117
+ - Want to specialize a model for a narrow task distribution
118
+ - Teacher and student can be the same model (self-distillation)
119
+
120
+ ### Advanced Configuration
121
+
122
+ - **Teacher model selection**: Choose based on quality requirements
123
+ - **Sampling strategies**: Adjust temperature and generation parameters
124
+ - **Data volume**: Scale generated examples based on task complexity
125
+ - **Training hyperparameters**: Follow standard SL hyperparameter guidance
126
+
127
+ ## LR Sweep Methodology
128
+
129
+ For finding task-specific optimal LR:
130
+
131
+ ### Setup
132
+
133
+ ```python
134
+ from tinker_cookbook.hyperparam_utils import get_lr
135
+ default_lr = get_lr("meta-llama/Llama-3.1-8B") # ~2.8e-4
136
+ ```
137
+
138
+ ### Sweep Range
139
+
140
+ Sweep one order of magnitude above and below default:
141
+
142
+ ```bash
143
+ # Launch in parallel (separate terminals)
144
+ python -m tinker_cookbook.recipes.sl_loop learning_rate=0.003 log_path=/tmp/sweep/lr-0.003
145
+ python -m tinker_cookbook.recipes.sl_loop learning_rate=0.001 log_path=/tmp/sweep/lr-0.001
146
+ python -m tinker_cookbook.recipes.sl_loop learning_rate=0.0003 log_path=/tmp/sweep/lr-0.0003
147
+ python -m tinker_cookbook.recipes.sl_loop learning_rate=0.0001 log_path=/tmp/sweep/lr-0.0001
148
+ python -m tinker_cookbook.recipes.sl_loop learning_rate=0.00003 log_path=/tmp/sweep/lr-0.00003
149
+ python -m tinker_cookbook.recipes.sl_loop learning_rate=0.00001 log_path=/tmp/sweep/lr-0.00001
150
+ ```
151
+
152
+ ### Collect and Visualize
153
+
154
+ ```python
155
+ from glob import glob
156
+ import pandas, json, os
157
+
158
+ data = []
159
+ for fname in sorted(glob("/tmp/sweep/*/metrics.jsonl")):
160
+ df = pandas.read_json(fname, lines=True)
161
+ if len(df) == 0 or df["progress"].iloc[-1] < 0.98:
162
+ continue
163
+ config = json.load(open(fname.replace("metrics.jsonl", "config.json")))
164
+ data.append({
165
+ "learning_rate": config["learning_rate"],
166
+ "final_loss": df["train_mean_nll"].iloc[-1].item()
167
+ })
168
+
169
+ df = pandas.DataFrame(data)
170
+ optimal_lr = df["learning_rate"][df["final_loss"].idxmin()]
171
+ print(f"Optimal LR: {optimal_lr:.2e}")
172
+ ```
173
+
174
+ Expected result: U-shaped curve with optimal LR near the `get_lr()` default.
@@ -0,0 +1,183 @@
1
+ # Evaluations
2
+
3
+ ## Inline Evals (During Training)
4
+
5
+ Add evaluations that run periodically during training.
6
+
7
+ ### Supervised Fine-Tuning
8
+
9
+ ```python
10
+ blueprint = chz.Blueprint(train.Config).apply({
11
+ "model_name": model_name,
12
+ "dataset_builder": dataset_builder,
13
+ "evaluator_builders": [my_evaluator], # Runs every eval_every steps
14
+ "infrequent_evaluator_builders": [heavy_eval], # Runs every infrequent_eval_every steps
15
+ "eval_every": 8,
16
+ "infrequent_eval_every": 50,
17
+ })
18
+ ```
19
+
20
+ ### RL Training
21
+
22
+ ```python
23
+ blueprint = chz.Blueprint(train.Config).apply({
24
+ "model_name": model_name,
25
+ "dataset_builder": builder,
26
+ "evaluator_builders": [sampling_eval], # SamplingClientEvaluator instances
27
+ "eval_every": 5,
28
+ })
29
+ ```
30
+
31
+ ## Offline Evals with Inspect AI
32
+
33
+ Run standard evaluations on checkpoints using the Inspect AI library:
34
+
35
+ ```bash
36
+ MODEL_PATH=tinker://YOUR_MODEL_PATH_HERE
37
+ python -m tinker_cookbook.eval.run_inspect_evals \
38
+ model_path=$MODEL_PATH \
39
+ model_name=MODEL_NAME \
40
+ tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot \
41
+ renderer_name=RENDERER_NAME
42
+ ```
43
+
44
+ ### Creating Custom Inspect AI Tasks
45
+
46
+ ```python
47
+ import tinker
48
+ from inspect_ai import Task, task
49
+ from inspect_ai.dataset import MemoryDataset, Sample
50
+ from inspect_ai.model import GenerateConfig as InspectAIGenerateConfig
51
+ from inspect_ai.model import Model as InspectAIModel
52
+ from inspect_ai.scorer import model_graded_qa
53
+ from inspect_ai.solver import generate
54
+ from tinker_cookbook.eval.inspect_utils import InspectAPIFromTinkerSampling
55
+
56
+ QA_DATASET = MemoryDataset(
57
+ name="qa_dataset",
58
+ samples=[
59
+ Sample(input="What is the capital of France?", target="Paris"),
60
+ Sample(input="What is the capital of Italy?", target="Rome"),
61
+ ],
62
+ )
63
+
64
+ service_client = tinker.ServiceClient()
65
+ sampling_client = service_client.create_sampling_client(
66
+ base_model="meta-llama/Llama-3.1-8B-Instruct"
67
+ )
68
+
69
+ api = InspectAPIFromTinkerSampling(
70
+ renderer_name="llama3",
71
+ model_name="meta-llama/Llama-3.1-8B-Instruct",
72
+ sampling_client=sampling_client,
73
+ verbose=False,
74
+ )
75
+
76
+ GRADER_MODEL = InspectAIModel(api=api, config=InspectAIGenerateConfig())
77
+
78
+ @task
79
+ def example_lm_as_judge() -> Task:
80
+ return Task(
81
+ name="llm_as_judge",
82
+ dataset=QA_DATASET,
83
+ solver=generate(),
84
+ scorer=model_graded_qa(
85
+ instructions="Grade strictly. Respond 'GRADE: C' if correct or 'GRADE: I' otherwise.",
86
+ partial_credit=False,
87
+ model=GRADER_MODEL,
88
+ ),
89
+ )
90
+ ```
91
+
92
+ Inspect also supports any OpenAI-compatible API (e.g., openrouter) as the grader model.
93
+
94
+ ## Custom SamplingClientEvaluator
95
+
96
+ Lower-level abstraction with fine-grained control:
97
+
98
+ ```python
99
+ from typing import Any, Callable
100
+ import tinker
101
+ from tinker import types
102
+ from tinker_cookbook import renderers
103
+ from tinker_cookbook.evaluators import SamplingClientEvaluator
104
+ from tinker_cookbook.tokenizer_utils import get_tokenizer
105
+
106
+ class CustomEvaluator(SamplingClientEvaluator):
107
+ def __init__(
108
+ self,
109
+ dataset: Any,
110
+ grader_fn: Callable[[str, str], bool],
111
+ model_name: str,
112
+ renderer_name: str,
113
+ ):
114
+ self.dataset = dataset
115
+ self.grader_fn = grader_fn
116
+ tokenizer = get_tokenizer(model_name)
117
+ self.renderer = renderers.get_renderer(name=renderer_name, tokenizer=tokenizer)
118
+
119
+ async def __call__(self, sampling_client: tinker.SamplingClient) -> dict[str, float]:
120
+ sampling_params = types.SamplingParams(
121
+ max_tokens=100,
122
+ temperature=0.7,
123
+ top_p=1.0,
124
+ stop=self.renderer.get_stop_sequences(),
125
+ )
126
+
127
+ num_correct = 0
128
+ for datum in self.dataset:
129
+ model_input = self.renderer.build_generation_prompt(
130
+ [renderers.Message(role="user", content=datum["input"])]
131
+ )
132
+ r = await sampling_client.sample_async(
133
+ prompt=model_input, num_samples=1, sampling_params=sampling_params
134
+ )
135
+ tokens = r.sequences[0].tokens
136
+ response = self.renderer.parse_response(tokens)[0]
137
+ if self.grader_fn(response["content"], datum["output"]):
138
+ num_correct += 1
139
+
140
+ return {"accuracy": num_correct / len(self.dataset)}
141
+ ```
142
+
143
+ ### Usage
144
+
145
+ ```python
146
+ import asyncio
147
+
148
+ QA_DATASET = [
149
+ {"input": "What is the capital of France?", "output": "Paris"},
150
+ {"input": "What is the capital of Germany?", "output": "Berlin"},
151
+ ]
152
+
153
+ def grader_fn(response: str, target: str) -> bool:
154
+ return target.lower() in response.lower()
155
+
156
+ evaluator = CustomEvaluator(
157
+ dataset=QA_DATASET,
158
+ grader_fn=grader_fn,
159
+ renderer_name="llama3",
160
+ model_name="meta-llama/Llama-3.1-8B-Instruct",
161
+ )
162
+
163
+ service_client = tinker.ServiceClient()
164
+ sampling_client = service_client.create_sampling_client(
165
+ base_model="meta-llama/Llama-3.1-8B-Instruct"
166
+ )
167
+
168
+ async def main():
169
+ result = await evaluator(sampling_client)
170
+ print(result) # {"accuracy": 1.0}
171
+
172
+ asyncio.run(main())
173
+ ```
174
+
175
+ ## Evaluation Strategy
176
+
177
+ | Stage | Method | When |
178
+ |-------|--------|------|
179
+ | During SFT | `evaluator_builders` | Every N training steps |
180
+ | During RL | `evaluator_builders` (SamplingClientEvaluator) | Every N iterations |
181
+ | After training | `run_inspect_evals` CLI | On final checkpoint |
182
+ | Custom tasks | Custom `SamplingClientEvaluator` | Any time with sampling client |
183
+ | LLM-as-judge | Inspect AI with `model_graded_qa` | When automated grading needed |
@@ -61,7 +61,7 @@ import numpy as np
61
61
 
62
62
  for _ in range(6):
63
63
  fwdbwd_future = training_client.forward_backward(processed_examples, "cross_entropy")
64
- optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4))
64
+ optim_future = training_client.optim_step(types.AdamParams(learning_rate=2e-4)) # Use get_lr() for production
65
65
 
66
66
  fwdbwd_result = fwdbwd_future.result()
67
67
  optim_result = optim_future.result()
@@ -3,7 +3,7 @@
3
3
  ## Model Selection Guide
4
4
 
5
5
  - **Use MoE models** - More cost effective than dense
6
- - **Base models** - Only for research or full post-training
6
+ - **Base models** - For LoRA-based post-training research (no built-in chat format)
7
7
  - **Instruction models** - Fast inference, no chain-of-thought
8
8
  - **Hybrid/Reasoning models** - Long chain-of-thought for quality
9
9
 
@@ -36,7 +36,7 @@
36
36
  **Sizes:** Compact (1-4B), Small (8B), Medium (30-32B), Large (70B+)
37
37
 
38
38
  **Types:**
39
- - **Base**: Pretrained, for post-training research
39
+ - **Base**: Pretrained, no chat formatting — use for LoRA post-training research
40
40
  - **Instruction**: Chat-tuned, fast inference
41
41
  - **Hybrid**: Thinking + non-thinking modes
42
42
  - **Reasoning**: Always uses chain-of-thought
@@ -51,6 +51,12 @@ LoRA (Low-Rank Adaptation) fine-tunes small parameter subset instead of all weig
51
51
  - SL on small-medium instruction datasets: **Same as full fine-tuning**
52
52
  - RL: **Equivalent to full fine-tuning even with small ranks**
53
53
  - Large datasets: May underperform (increase rank)
54
+ - LoRA performs better when applied to **all weight matrices** (attention + MLP + MoE). Attention-only LoRA underperforms even with matched parameter counts
55
+
56
+ ### LoRA Limitations
57
+
58
+ - **Large batch sizes**: LoRA is less tolerant of large batch sizes than full FT — pays a larger loss penalty as batch size increases beyond some point. This penalty is NOT mitigated by increasing rank; it's a property of the product-of-matrices parametrization
59
+ - **Large SL datasets**: When dataset exceeds LoRA capacity, results in worse training efficiency (not a distinct floor)
54
60
 
55
61
  ### LoRA Learning Rate
56
62
 
@@ -61,7 +67,10 @@ from tinker_cookbook.hyperparam_utils import get_lora_lr_over_full_finetune_lr
61
67
 
62
68
  model_name = "meta-llama/Llama-3.1-8B"
63
69
  factor = get_lora_lr_over_full_finetune_lr(model_name)
64
- # Returns 10.0 for all models (empirically validated)
70
+ # Factor varies by model size:
71
+ # Llama-3.2-1B → 32
72
+ # Llama-3.1-8B → ~50
73
+ # Llama-3.1-70B → 128
65
74
  ```
66
75
 
67
76
  ### Recommended Learning Rate
@@ -12,6 +12,7 @@ from tinker_cookbook.renderers import TrainOnWhat
12
12
  from tinker_cookbook.supervised import train
13
13
  from tinker_cookbook.supervised.data import FromConversationFileBuilder
14
14
  from tinker_cookbook.supervised.types import ChatDatasetBuilderCommonConfig
15
+ from tinker_cookbook.hyperparam_utils import get_lr
15
16
 
16
17
  def build_config_blueprint() -> chz.Blueprint[train.Config]:
17
18
  model_name = "meta-llama/Llama-3.1-8B"
@@ -35,7 +36,7 @@ def build_config_blueprint() -> chz.Blueprint[train.Config]:
35
36
  "log_path": "/tmp/tinker-examples/sl_basic",
36
37
  "model_name": model_name,
37
38
  "dataset_builder": dataset,
38
- "learning_rate": 2e-4,
39
+ "learning_rate": get_lr(model_name),
39
40
  "lr_schedule": "linear",
40
41
  "num_epochs": 1,
41
42
  "eval_every": 8,
@@ -61,13 +62,14 @@ from tinker_cookbook import checkpoint_utils, model_info, renderers
61
62
  from tinker_cookbook.supervised.common import compute_mean_nll
62
63
  from tinker_cookbook.supervised.data import conversation_to_datum
63
64
  from tinker_cookbook.tokenizer_utils import get_tokenizer
65
+ from tinker_cookbook.hyperparam_utils import get_lr
64
66
 
65
67
  @chz.chz
66
68
  class Config:
67
69
  log_path: str = "/tmp/tinker-examples/sl-loop"
68
70
  model_name: str = "meta-llama/Llama-3.1-8B"
69
71
  batch_size: int = 128
70
- learning_rate: float = 1e-4
72
+ learning_rate: float = get_lr("meta-llama/Llama-3.1-8B") # ~2.8e-4
71
73
  max_length: int = 32768
72
74
  train_on_what: renderers.TrainOnWhat = renderers.TrainOnWhat.ALL_ASSISTANT_MESSAGES
73
75
  lora_rank: int = 32
@@ -257,6 +259,41 @@ if __name__ == "__main__":
257
259
  chz.nested_entrypoint(main)
258
260
  ```
259
261
 
262
+ ## DPO / Preference Learning
263
+
264
+ ```bash
265
+ python -m tinker_cookbook.recipes.preference.train \
266
+ log_path=/tmp/dpo-experiment \
267
+ model_name=meta-llama/Llama-3.2-1B \
268
+ dataset=hhh \
269
+ renderer_name=role_colon \
270
+ learning_rate=1e-5 \
271
+ dpo_beta=0.1
272
+ ```
273
+
274
+ Available datasets: `hhh` (Anthropic), `helpsteer3` (NVIDIA), `ultrafeedback`
275
+
276
+ See [DPO & Preference Learning](dpo-and-preference.md) for full RLHF pipeline and parameters.
277
+
278
+ ## Prompt Distillation
279
+
280
+ ```bash
281
+ # Step 1: Generate training data with teacher model
282
+ python -m tinker_cookbook.recipes.prompt_distillation.create_data \
283
+ output_file=/tmp/tinker-datasets/prompt_distillation_lang.jsonl
284
+
285
+ # Step 2: Train student model on distilled data
286
+ python -m tinker_cookbook.recipes.prompt_distillation.train
287
+ ```
288
+
289
+ ## Multi-Step RL (Twenty Questions)
290
+
291
+ ```bash
292
+ python -m tinker_cookbook.recipes.twenty_questions.train
293
+ ```
294
+
295
+ Complete multi-step environment where a question-asking agent learns to guess hidden words. Good reference for building custom multi-turn RL environments.
296
+
260
297
  ## Running Recipes
261
298
 
262
299
  ```bash
@@ -271,6 +308,15 @@ python -m tinker_cookbook.recipes.rl_basic
271
308
 
272
309
  # Manual RL loop
273
310
  python -m tinker_cookbook.recipes.rl_loop
311
+
312
+ # DPO
313
+ python -m tinker_cookbook.recipes.preference.train
314
+
315
+ # Prompt distillation
316
+ python -m tinker_cookbook.recipes.prompt_distillation.train
317
+
318
+ # Multi-step RL
319
+ python -m tinker_cookbook.recipes.twenty_questions.train
274
320
  ```
275
321
 
276
322
  ## CLI Overrides
@@ -11,6 +11,8 @@ Fine-tunes Llama-3.1-8B on GSM8K with reward:
11
11
  1[answer correct] + 0.1 * (1[format correct] - 1)
12
12
  ```
13
13
 
14
+ Training takes ~1 min/iteration, reaches ~63% accuracy after 15 iterations.
15
+
14
16
  ## Basic RL Config
15
17
 
16
18
  ```python
@@ -54,6 +56,93 @@ if __name__ == "__main__":
54
56
  - `entropy` - Per-token entropy
55
57
  - `kl_sample_train_v1/v2` - KL divergence (sampler vs learner)
56
58
 
59
+ ## RL Environment Classes
60
+
61
+ Custom RL environments use three classes from `tinker_cookbook.rl.types`:
62
+
63
+ ### Env
64
+
65
+ Stateful environment for a single agent episode. Discard after one episode.
66
+
67
+ ```python
68
+ from tinker_cookbook.rl.types import Env, Observation, StopCondition, Action, StepResult
69
+
70
+ class MyEnv(Env):
71
+ async def initial_observation(self) -> tuple[Observation, StopCondition]:
72
+ # Return initial tokens and stop condition
73
+ raise NotImplementedError
74
+
75
+ async def step(self, action: Action) -> StepResult:
76
+ # Process agent action, return next observation + reward
77
+ raise NotImplementedError
78
+ ```
79
+
80
+ Note: `Env` operates on tokens (not strings/messages) because the training code needs exact tokens and logprobs.
81
+
82
+ ### EnvGroupBuilder
83
+
84
+ Creates groups of environments (enables multi-agent training or paired comparisons):
85
+
86
+ ```python
87
+ from tinker_cookbook.rl.types import EnvGroupBuilder
88
+
89
+ class MyEnvGroupBuilder(EnvGroupBuilder):
90
+ async def make_envs(self) -> list[Env]:
91
+ return [MyEnv() for _ in range(group_size)]
92
+ ```
93
+
94
+ ### RLDataset
95
+
96
+ Dataset of `EnvGroupBuilder` objects:
97
+
98
+ ```python
99
+ from tinker_cookbook.rl.types import RLDataset
100
+
101
+ class MyDataset(RLDataset):
102
+ def get_batch(self, index: int) -> list[EnvGroupBuilder]:
103
+ return [MyEnvGroupBuilder(problem) for problem in self.problems[index]]
104
+ ```
105
+
106
+ ### Multi-Step Environment Example
107
+
108
+ See `tinker_cookbook.recipes.twenty_questions` for a complete multi-step environment where a question-asking agent learns to guess hidden words:
109
+
110
+ ```bash
111
+ python -m tinker_cookbook.recipes.twenty_questions.train
112
+ ```
113
+
114
+ ## Completers (Policy Abstractions)
115
+
116
+ Completers represent models/policies that can be sampled from.
117
+
118
+ ### TokenCompleter (for RL training)
119
+
120
+ ```python
121
+ from tinker_cookbook.rl.types import TokenCompleter, TokensWithLogprobs
122
+
123
+ class TokenCompleter:
124
+ async def __call__(
125
+ self, model_input: types.ModelInput, stop: StopCondition
126
+ ) -> TokensWithLogprobs:
127
+ ...
128
+ ```
129
+
130
+ Used by RL algorithms because they work directly with tokens.
131
+
132
+ ### MessageCompleter (for inference/judging)
133
+
134
+ ```python
135
+ from tinker_cookbook.rl.types import MessageCompleter
136
+
137
+ class MessageCompleter:
138
+ async def __call__(self, messages: list[renderers.Message]) -> renderers.Message:
139
+ ...
140
+ ```
141
+
142
+ Operates at message level — useful for judge models, multi-agent environments, and evaluation. Requires a renderer to convert messages to tokens for the sampling client.
143
+
144
+ Concrete implementations: `TinkerTokenCompleter` and `TinkerMessageCompleter` (wrappers around `tinker.SamplingClient`).
145
+
57
146
  ## Custom RL Loop
58
147
 
59
148
  ```python
@@ -143,12 +232,18 @@ def main(config: Config):
143
232
 
144
233
  ## Hyperparameters
145
234
 
235
+ ### Learning Rate
236
+
237
+ Same guidance as SL — use `get_lr(model_name)` as starting point. The rl_basic recipe uses `4e-5` for Llama-3.1-8B.
238
+
146
239
  ### Batch and Group Sizes
147
240
 
148
- - `batch_size`: Number of unique problems
241
+ - `batch_size`: Number of unique problems per iteration
149
242
  - `group_size`: Rollouts per problem (for variance reduction)
150
243
 
151
- Scale: `LR ∝ √batch_size`
244
+ Scale LR proportionally: `LR ∝ √batch_size`
245
+
246
+ If you have limited problems, increase `group_size` to generate more training data.
152
247
 
153
248
  ### Multiple Updates (num_substeps)
154
249
 
@@ -157,14 +252,14 @@ Scale: `LR ∝ √batch_size`
157
252
  num_substeps = 1
158
253
 
159
254
  # Multiple updates: split batch into mini-batches
160
- num_substeps = 4 # Batch must be divisible
255
+ num_substeps = 4 # Batch must be divisible by num_substeps
161
256
  ```
162
257
 
163
- Use with PPO objective. Start with 2-4.
258
+ Use with PPO objective. Start with 2-4. Higher values risk out-of-distribution updates.
164
259
 
165
260
  ### Streaming Minibatch Training
166
261
 
167
- Overlaps sampling and training for throughput:
262
+ Overlaps sampling and training for throughput (on-policy, pipeline efficiency only):
168
263
 
169
264
  ```python
170
265
  StreamMinibatchConfig(
@@ -175,22 +270,72 @@ StreamMinibatchConfig(
175
270
 
176
271
  ### Async Off-Policy Training
177
272
 
178
- For long rollouts:
273
+ For long rollouts (CoT, tool use, agentic workflows):
179
274
 
180
275
  ```python
181
276
  AsyncConfig(
182
- max_steps_off_policy=3, # Max age of trajectories
277
+ max_steps_off_policy=3, # Max age of trajectories before discard
183
278
  groups_per_batch=64,
184
279
  )
185
280
  ```
186
281
 
282
+ Start with `max_steps_off_policy < 5`. Monitor KL divergence carefully.
283
+
284
+ ## Sequence Extension Property
285
+
286
+ Critical for multi-turn RL efficiency. When consecutive timesteps satisfy the extension property (each observation is a prefix extension of the previous), compute scales O(T) instead of O(T²).
287
+
288
+ ### When Extension Holds
289
+
290
+ With `Qwen3Renderer(strip_thinking_from_history=False)`, thinking blocks are preserved:
291
+
292
+ ```
293
+ Timestep 1: [User: Q1] [A: <think>...</think> A1] [User:]
294
+ Timestep 2: [User: Q1] [A: <think>...</think> A1] [User: Q2] [A: <think>...</think> A2] [User:]
295
+ ```
296
+
297
+ Timestep 2 contains timestep 1 as a prefix → single Datum, KV-cache reuse.
298
+
299
+ ### When Extension Breaks
300
+
301
+ With default `strip_thinking_from_history=True`, `<think>` blocks are stripped from history:
302
+
303
+ ```
304
+ Timestep 1: [User: Q1] [A: <think>...</think> A1] [User:]
305
+ Timestep 2: [User: Q1] [A: A1] [User: Q2] [A: <think>...</think> A2] [User:]
306
+ ```
307
+
308
+ Prefix doesn't match → separate Datums per timestep, O(T²) compute.
309
+
310
+ ### Check Programmatically
311
+
312
+ ```python
313
+ renderer.has_extension_property # True or False
314
+ ```
315
+
316
+ For `Qwen3Renderer`:
317
+ - `strip_thinking_from_history=False` → `has_extension_property=True`
318
+ - `strip_thinking_from_history=True` (default) → `has_extension_property=False`
319
+
320
+ ### Periodic Compaction (Hybrid)
321
+
322
+ Keep thinking visible most of the time, periodically strip old thinking blocks:
323
+
324
+ - Turns 1-N: keep thinking visible (extension holds, single datum)
325
+ - Turn N+1: strip thinking from turns 1-N (extension breaks once)
326
+ - Turns N+1 to 2N: keep thinking visible again
327
+ - Repeat every N turns
328
+
329
+ This amortizes recomputation cost over N turns with bounded context growth.
330
+
187
331
  ## Monitoring
188
332
 
189
333
  ### KL Divergence
190
334
 
191
335
  Monitor `kl_sample_train_v1/v2`:
192
336
  - Should stay below 0.01 for stable training
193
- - High KL indicates numerical instability
337
+ - Even on-policy training shows non-zero KL (implementation detail)
338
+ - KL crossing threshold indicates numerical instability
194
339
 
195
340
  ### Reward Curves
196
341
 
@@ -49,10 +49,22 @@ model_input, weights = renderer.build_supervised_example(
49
49
  messages,
50
50
  train_on_what=TrainOnWhat.ALL_ASSISTANT_MESSAGES,
51
51
  )
52
- # model_input: ModelInput (list of chunks)
52
+ # model_input: ModelInput (token sequence)
53
53
  # weights: per-token weights (0.0 = prompt, 1.0 = train)
54
54
  ```
55
55
 
56
+ **Default behavior:** If `train_on_what` is omitted, defaults to `LAST_ASSISTANT_MESSAGE`. Always specify explicitly to avoid surprises.
57
+
58
+ ```python
59
+ # WRONG — silently trains only on last assistant message
60
+ model_input, weights = renderer.build_supervised_example(messages)
61
+
62
+ # RIGHT — explicitly train on all assistant messages
63
+ model_input, weights = renderer.build_supervised_example(
64
+ messages, train_on_what=TrainOnWhat.ALL_ASSISTANT_MESSAGES
65
+ )
66
+ ```
67
+
56
68
  ### build_generation_prompt
57
69
 
58
70
  For inference:
@@ -17,6 +17,7 @@ from tinker_cookbook.supervised.types import ChatDatasetBuilderCommonConfig
17
17
  from tinker_cookbook.supervised.data import FromConversationFileBuilder
18
18
  from tinker_cookbook.renderers import TrainOnWhat
19
19
  from tinker_cookbook.model_info import get_recommended_renderer_name
20
+ from tinker_cookbook.hyperparam_utils import get_lr
20
21
 
21
22
  def build_config_blueprint() -> chz.Blueprint[train.Config]:
22
23
  model_name = "meta-llama/Llama-3.1-8B"
@@ -39,8 +40,8 @@ def build_config_blueprint() -> chz.Blueprint[train.Config]:
39
40
  "log_path": "/tmp/training",
40
41
  "model_name": model_name,
41
42
  "dataset_builder": dataset_builder,
42
- "learning_rate": 2e-4,
43
- "lr_schedule": "cosine",
43
+ "learning_rate": get_lr(model_name),
44
+ "lr_schedule": "linear",
44
45
  "num_epochs": 3,
45
46
  "lora_rank": 32,
46
47
  })
@@ -182,19 +183,23 @@ class CustomDataset(SupervisedDataset):
182
183
  def __iter__(self):
183
184
  for item in self.data:
184
185
  messages = self._preprocess(item)
185
- example = self.renderer.build_supervised_example(
186
+ # build_supervised_example returns (model_input, weights) tuple
187
+ model_input, weights = self.renderer.build_supervised_example(
186
188
  messages=messages,
187
189
  train_on_what=TrainOnWhat.ALL_ASSISTANT_MESSAGES,
188
190
  )
191
+ tokens = model_input.to_ints()
189
192
  yield Datum(
190
- model_input=ModelInput([example.chunk]),
193
+ model_input=ModelInput.from_ints(tokens=tokens[:-1]),
191
194
  loss_fn_inputs={
192
- "target_tokens": TensorData.from_numpy(np.array(example.target_tokens, dtype=np.int64)),
193
- "weights": TensorData.from_numpy(np.array(example.weights, dtype=np.float32)),
195
+ "target_tokens": TensorData.from_numpy(np.array(tokens[1:], dtype=np.int64)),
196
+ "weights": TensorData.from_numpy(np.array(weights[1:], dtype=np.float32)),
194
197
  },
195
198
  )
196
199
  ```
197
200
 
201
+ **Note:** `build_supervised_example` returns a tuple `(model_input, weights)`. The `model_input` is a `ModelInput` (token sequence), and `weights` is a list of per-token floats (0.0 for prompt tokens, 1.0 for completion tokens). When constructing `Datum` manually, shift by one position: input is `tokens[:-1]`, targets are `tokens[1:]`, weights are `weights[1:]`.
202
+
198
203
  ## Hyperparameters
199
204
 
200
205
  ### Learning Rate
@@ -215,7 +220,26 @@ Formula: `LR = lr_base * M_LoRA * (2000/H_m)^P_m`
215
220
 
216
221
  - Smaller batch sizes (128) generally better for fine-tuning
217
222
  - Scale LR with `LR ∝ √batch_size`
218
- - Aim for at least 100 steps of training
223
+ - Aim for at least 100 steps of training (1000+ steps for best results)
224
+
225
+ ## Checkpoints
226
+
227
+ Two types of saved weights:
228
+
229
+ | Method | Path Contains | Use Case |
230
+ |--------|--------------|----------|
231
+ | `save_weights_for_sampler()` | `/sampler_weights/` | Sampling/inference only (lightweight) |
232
+ | `save_state()` | `/weights/` | Full optimizer state for resuming training |
233
+
234
+ ```python
235
+ # Save for sampling
236
+ path = training_client.save_weights_for_sampler(name="final").result().path
237
+ sampling_client = service_client.create_sampling_client(model_path=path)
238
+
239
+ # Save full state for resuming
240
+ state_path = training_client.save_state(name="checkpoint").result().path
241
+ training_client.load_state(state_path) # Resume later
242
+ ```
219
243
 
220
244
  ## Output Files
221
245
 
package/bin/synsc CHANGED
Binary file
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@synsci/cli-linux-x64-musl",
3
- "version": "1.1.50",
3
+ "version": "1.1.52",
4
4
  "os": [
5
5
  "linux"
6
6
  ],