npm - @synsci/cli-linux-x64-musl - Versions diffs - 1.1.50 → 1.1.52 - Mend

@synsci/cli-linux-x64-musl 1.1.50 → 1.1.52

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/bin/skills/tinker/SKILL.md +64 -6
package/bin/skills/tinker/references/dpo-and-preference.md +174 -0
package/bin/skills/tinker/references/evaluations.md +183 -0
package/bin/skills/tinker/references/getting-started.md +1 -1
package/bin/skills/tinker/references/models-and-lora.md +12 -3
package/bin/skills/tinker/references/recipes.md +48 -2
package/bin/skills/tinker/references/reinforcement-learning.md +153 -8
package/bin/skills/tinker/references/rendering.md +13 -1
package/bin/skills/tinker/references/supervised-learning.md +31 -7
package/bin/synsc +0 -0
package/package.json +1 -1

package/bin/skills/tinker/SKILL.md CHANGED Viewed

@@ -4,7 +4,7 @@ description: Provides guidance for fine-tuning LLMs using the Tinker cloud train
 version: 1.0.0
 author: Synthetic Sciences
 license: MIT
-tags: [Fine-Tuning, Tinker, LoRA, Reinforcement Learning, Supervised Learning, Cloud Training, Vision-Language Models]
+tags: [Fine-Tuning, Tinker, LoRA, Reinforcement Learning, Supervised Learning, DPO, RLHF, Cloud Training, Vision-Language Models]
 dependencies: [tinker, tinker-cookbook, chz, transformers>=4.40.0, datasets, numpy]
 ---
@@ -44,10 +44,12 @@ Expert guidance for fine-tuning large language models using Tinker's managed clo
 | Setup & Core Concepts | [Getting Started](references/getting-started.md) |
 | API Classes & Types | [API Reference](references/api-reference.md) |
 | Supervised Learning | [Supervised Learning](references/supervised-learning.md) |
-| RL Training | [Reinforcement Learning](references/reinforcement-learning.md) |
+| RL Training & Environments | [Reinforcement Learning](references/reinforcement-learning.md) |
+| DPO, RLHF & Distillation | [DPO & Preference Learning](references/dpo-and-preference.md) |
 | Loss Functions | [Loss Functions](references/loss-functions.md) |
 | Chat Templates | [Rendering](references/rendering.md) |
 | Models & LoRA | [Models & LoRA](references/models-and-lora.md) |
+| Evaluations | [Evaluations](references/evaluations.md) |
 | Example Scripts | [Recipes](references/recipes.md) |
 ## Installation
@@ -81,6 +83,7 @@ from tinker_cookbook.supervised.types import ChatDatasetBuilderCommonConfig
 from tinker_cookbook.supervised.data import FromConversationFileBuilder
 from tinker_cookbook.renderers import TrainOnWhat
 from tinker_cookbook.model_info import get_recommended_renderer_name
+from tinker_cookbook.hyperparam_utils import get_lr
 model_name = "Qwen/Qwen3-30B-A3B"
 renderer_name = get_recommended_renderer_name(model_name)
@@ -102,8 +105,8 @@ blueprint = chz.Blueprint(train.Config).apply({
     "log_path": "/tmp/sft-run",
     "model_name": model_name,
     "dataset_builder": dataset_builder,
-    "learning_rate": 2e-4,
-    "lr_schedule": "cosine",
+    "learning_rate": get_lr(model_name),
+    "lr_schedule": "linear",
     "num_epochs": 3,
     "lora_rank": 32,
 })
@@ -274,14 +277,66 @@ training_client = service_client.create_lora_training_client(
 | Parameter | SFT Default | RL Default | Notes |
 |-----------|-------------|------------|-------|
-| `learning_rate` | 2e-4 | 4e-5 | Use `get_lr(model_name)` for recommended |
+| `learning_rate` | `get_lr(model)` | 4e-5 | Model-dependent; ~5e-4 for Qwen3-30B, ~2.8e-4 for Llama-8B |
 | `batch_size` | 128 | 128 | Smaller generally better for fine-tuning |
 | `lora_rank` | 32 | 32 | Higher rank = more capacity |
 | `group_size` | N/A | 16 | Rollouts per problem for RL |
 | `max_length` | 2048-32768 | N/A | Sequence length for SFT |
 | `max_tokens` | N/A | 256 | Max generation length for RL |
 | `num_epochs` | 1-3 | N/A | Training passes |
-| `lr_schedule` | cosine | N/A | LR decay schedule |
+| `lr_schedule` | linear | N/A | Only `linear` and `constant` supported |
+## Workflow 3: DPO (Preference Learning)
+Use this for aligning models with human preferences without a separate reward model.
+### Quick Start
+```bash
+python -m tinker_cookbook.recipes.preference.train \
+    log_path=/tmp/dpo-experiment \
+    model_name=meta-llama/Llama-3.2-1B \
+    dataset=hhh \
+    renderer_name=role_colon \
+    learning_rate=1e-5 \
+    dpo_beta=0.1
+```
+**Key differences from SFT**: Use lower LR (1e-5 to 1e-6), base model should be in-distribution with preference data.
+**Available datasets**: `hhh` (Anthropic), `helpsteer3` (NVIDIA), `ultrafeedback`
+**Full RLHF pipeline**: See [DPO & Preference Learning](references/dpo-and-preference.md) for the three-step SL → preference model → RL pipeline.
+---
+## Evaluations
+### Inline (During Training)
+Add `evaluator_builders` to config for periodic evaluation:
+```python
+blueprint = chz.Blueprint(train.Config).apply({
+    ...
+    "evaluator_builders": [my_evaluator],
+    "eval_every": 8,
+})
+```
+### Offline (After Training)
+```bash
+MODEL_PATH=tinker://YOUR_MODEL_PATH_HERE
+python -m tinker_cookbook.eval.run_inspect_evals \
+    model_path=$MODEL_PATH \
+    model_name=MODEL_NAME \
+    tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot
+```
+See [Evaluations](references/evaluations.md) for custom evaluators and LLM-as-judge.
+---
 ## Cost Estimation & Usage Tracking
@@ -322,6 +377,9 @@ The CLI tracks all Tinker usage and reports it to the Synthetic Sciences dashboa
 | Reward stuck at 0 | Debug reward function independently, check answer extraction |
 | All advantages = 0 | Increase group size, ensure reward variance across completions |
 | Wrong tokenizer | Use model-specific tokenizer (see Models & LoRA reference) |
+| `Unknown learning rate schedule` | Only `"linear"` and `"constant"` are supported; `"cosine"` does NOT work |
+| Python 3.14 pydantic errors | Tinker requires Python 3.10-3.13; pydantic v1 is incompatible with 3.14+ |
+| Only 1 step per epoch | `batch_size` too large for dataset size; aim for 100+ steps per epoch |
 ## Saving and Resuming

package/bin/skills/tinker/references/dpo-and-preference.md ADDED Viewed

@@ -0,0 +1,174 @@
+# DPO & Preference Learning
+## Direct Preference Optimization (DPO)
+DPO trains models to prefer chosen responses over rejected ones using a classification loss, without needing a separate reward model.
+### Quick Start
+```bash
+python -m tinker_cookbook.recipes.preference.train \
+    log_path=/tmp/dpo-experiment \
+    model_name=meta-llama/Llama-3.2-1B \
+    dataset=hhh \
+    renderer_name=role_colon \
+    learning_rate=1e-5 \
+    dpo_beta=0.1
+```
+### Key Parameters
+| Parameter | Description | Recommended |
+|-----------|-------------|-------------|
+| `model_name` | Base model (also used as reference policy) | Start with 1B-8B |
+| `dataset` | Preference dataset name | `hhh`, `helpsteer3`, `ultrafeedback` |
+| `renderer_name` | Chat format renderer | Match model family |
+| `learning_rate` | LR for optimization | 1e-5 to 1e-6 (lower than SFT) |
+| `dpo_beta` | Preference strength | Start with 0.1 |
+| `log_path` | Output directory | Required |
+### Available Datasets
+| Dataset | Source | Description |
+|---------|--------|-------------|
+| `hhh` | Anthropic | Helpful-Harmless-Honest pairwise comparisons |
+| `helpsteer3` | NVIDIA | HelpSteer3 preference dataset |
+| `ultrafeedback` | UltraFeedback | Binarized preferences |
+Custom datasets: implement `DPODatasetBuilder` from `tinker_cookbook.preference.preference_datasets`.
+### Training Metrics
+| Metric | Description | Watch For |
+|--------|-------------|-----------|
+| `dpo_loss` | DPO classification loss | Should decrease |
+| `accuracy` | Implicit reward model accuracy | Should increase |
+| `margin` | Chosen - rejected reward gap | Should increase |
+| `chosen_reward` | Average reward for chosen responses | Higher is better |
+| `rejected_reward` | Average reward for rejected responses | Lower is better |
+### Tips
+- **Beta parameter**: Start with `dpo_beta=0.1`, adjust based on dataset
+- **Learning rate**: Use lower LR than SFT (1e-5 to 1e-6)
+- **Base model**: Should already be in-distribution with the preference data. Either start with a light SFT phase or collect on-policy preferences. Sharp distribution mismatch creates strange behaviors
+- **Evaluation**: Use Inspect AI to evaluate after training (see [Evaluations](evaluations.md))
+### Evaluating DPO Models
+```bash
+MODEL_PATH=tinker://YOUR_MODEL_PATH_HERE
+python -m tinker_cookbook.eval.run_inspect_evals \
+    model_path=$MODEL_PATH \
+    model_name=meta-llama/Llama-3.2-1B \
+    tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot
+```
+## RLHF Pipeline
+Full pipeline: SL → Preference Model → RL, implemented in `recipes/preference/rlhf/rlhf_pipeline.py`.
+```bash
+python -m recipes.preference.rlhf.rlhf_pipeline
+```
+### Step 1: Train Initial Policy (SL)
+Train on instruction-following data (e.g., no_robots from HuggingFace) to match InstructGPT methodology.
+### Step 2: Train Preference Model (SL)
+Train on pairwise comparison data (e.g., HHH dataset from Anthropic). Model sees completions A and B, outputs which is preferred.
+### Step 3: Train Policy via RL
+Use preference model as reward signal. For each prompt:
+1. Sample multiple completions from the policy
+2. Use preference model to grade all pairs
+3. Give reward based on win fraction (self-play)
+## Prompt Distillation
+Train a model to behave as if given a long prompt, without needing that prompt at inference time.
+### How It Works
+1. **Teacher generates data**: Long, detailed teacher prompt + queries → responses
+2. **Student trains on data**: Student model fine-tunes on (query, response) pairs without the teacher prompt
+### Example: Language Classification
+```bash
+# Step 1: Generate training data with teacher model
+python -m tinker_cookbook.recipes.prompt_distillation.create_data \
+    output_file=/tmp/tinker-datasets/prompt_distillation_lang.jsonl
+# Step 2: Train student model on distilled data
+python -m tinker_cookbook.recipes.prompt_distillation.train
+# Step 3: Test distilled model
+# Sample from trained model to verify performance
+```
+### When to Use
+- System prompt grows impractically long and model starts ignoring instructions
+- Need fast inference without long context overhead
+- Want to specialize a model for a narrow task distribution
+- Teacher and student can be the same model (self-distillation)
+### Advanced Configuration
+- **Teacher model selection**: Choose based on quality requirements
+- **Sampling strategies**: Adjust temperature and generation parameters
+- **Data volume**: Scale generated examples based on task complexity
+- **Training hyperparameters**: Follow standard SL hyperparameter guidance
+## LR Sweep Methodology
+For finding task-specific optimal LR:
+### Setup
+```python
+from tinker_cookbook.hyperparam_utils import get_lr
+default_lr = get_lr("meta-llama/Llama-3.1-8B")  # ~2.8e-4
+```
+### Sweep Range
+Sweep one order of magnitude above and below default:
+```bash
+# Launch in parallel (separate terminals)
+python -m tinker_cookbook.recipes.sl_loop learning_rate=0.003 log_path=/tmp/sweep/lr-0.003
+python -m tinker_cookbook.recipes.sl_loop learning_rate=0.001 log_path=/tmp/sweep/lr-0.001
+python -m tinker_cookbook.recipes.sl_loop learning_rate=0.0003 log_path=/tmp/sweep/lr-0.0003
+python -m tinker_cookbook.recipes.sl_loop learning_rate=0.0001 log_path=/tmp/sweep/lr-0.0001
+python -m tinker_cookbook.recipes.sl_loop learning_rate=0.00003 log_path=/tmp/sweep/lr-0.00003
+python -m tinker_cookbook.recipes.sl_loop learning_rate=0.00001 log_path=/tmp/sweep/lr-0.00001
+```
+### Collect and Visualize
+```python
+from glob import glob
+import pandas, json, os
+data = []
+for fname in sorted(glob("/tmp/sweep/*/metrics.jsonl")):
+    df = pandas.read_json(fname, lines=True)
+    if len(df) == 0 or df["progress"].iloc[-1] < 0.98:
+        continue
+    config = json.load(open(fname.replace("metrics.jsonl", "config.json")))
+    data.append({
+        "learning_rate": config["learning_rate"],
+        "final_loss": df["train_mean_nll"].iloc[-1].item()
+    })
+df = pandas.DataFrame(data)
+optimal_lr = df["learning_rate"][df["final_loss"].idxmin()]
+print(f"Optimal LR: {optimal_lr:.2e}")
+```
+Expected result: U-shaped curve with optimal LR near the `get_lr()` default.

package/bin/skills/tinker/references/evaluations.md ADDED Viewed

@@ -0,0 +1,183 @@
+# Evaluations
+## Inline Evals (During Training)
+Add evaluations that run periodically during training.
+### Supervised Fine-Tuning
+```python
+blueprint = chz.Blueprint(train.Config).apply({
+    "model_name": model_name,
+    "dataset_builder": dataset_builder,
+    "evaluator_builders": [my_evaluator],           # Runs every eval_every steps
+    "infrequent_evaluator_builders": [heavy_eval],   # Runs every infrequent_eval_every steps
+    "eval_every": 8,
+    "infrequent_eval_every": 50,
+})
+```
+### RL Training
+```python
+blueprint = chz.Blueprint(train.Config).apply({
+    "model_name": model_name,
+    "dataset_builder": builder,
+    "evaluator_builders": [sampling_eval],  # SamplingClientEvaluator instances
+    "eval_every": 5,
+})
+```
+## Offline Evals with Inspect AI
+Run standard evaluations on checkpoints using the Inspect AI library:
+```bash
+MODEL_PATH=tinker://YOUR_MODEL_PATH_HERE
+python -m tinker_cookbook.eval.run_inspect_evals \
+    model_path=$MODEL_PATH \
+    model_name=MODEL_NAME \
+    tasks=inspect_evals/ifeval,inspect_evals/mmlu_0_shot \
+    renderer_name=RENDERER_NAME
+```
+### Creating Custom Inspect AI Tasks
+```python
+import tinker
+from inspect_ai import Task, task
+from inspect_ai.dataset import MemoryDataset, Sample
+from inspect_ai.model import GenerateConfig as InspectAIGenerateConfig
+from inspect_ai.model import Model as InspectAIModel
+from inspect_ai.scorer import model_graded_qa
+from inspect_ai.solver import generate
+from tinker_cookbook.eval.inspect_utils import InspectAPIFromTinkerSampling
+QA_DATASET = MemoryDataset(
+    name="qa_dataset",
+    samples=[
+        Sample(input="What is the capital of France?", target="Paris"),
+        Sample(input="What is the capital of Italy?", target="Rome"),
+    ],
+)
+service_client = tinker.ServiceClient()
+sampling_client = service_client.create_sampling_client(
+    base_model="meta-llama/Llama-3.1-8B-Instruct"
+)
+api = InspectAPIFromTinkerSampling(
+    renderer_name="llama3",
+    model_name="meta-llama/Llama-3.1-8B-Instruct",
+    sampling_client=sampling_client,
+    verbose=False,
+)
+GRADER_MODEL = InspectAIModel(api=api, config=InspectAIGenerateConfig())
+@task
+def example_lm_as_judge() -> Task:
+    return Task(
+        name="llm_as_judge",
+        dataset=QA_DATASET,
+        solver=generate(),
+        scorer=model_graded_qa(
+            instructions="Grade strictly. Respond 'GRADE: C' if correct or 'GRADE: I' otherwise.",
+            partial_credit=False,
+            model=GRADER_MODEL,
+        ),
+    )
+```
+Inspect also supports any OpenAI-compatible API (e.g., openrouter) as the grader model.
+## Custom SamplingClientEvaluator
+Lower-level abstraction with fine-grained control:
+```python
+from typing import Any, Callable
+import tinker
+from tinker import types
+from tinker_cookbook import renderers
+from tinker_cookbook.evaluators import SamplingClientEvaluator
+from tinker_cookbook.tokenizer_utils import get_tokenizer
+class CustomEvaluator(SamplingClientEvaluator):
+    def __init__(
+        self,
+        dataset: Any,
+        grader_fn: Callable[[str, str], bool],
+        model_name: str,
+        renderer_name: str,
+    ):
+        self.dataset = dataset
+        self.grader_fn = grader_fn
+        tokenizer = get_tokenizer(model_name)
+        self.renderer = renderers.get_renderer(name=renderer_name, tokenizer=tokenizer)
+    async def __call__(self, sampling_client: tinker.SamplingClient) -> dict[str, float]:
+        sampling_params = types.SamplingParams(
+            max_tokens=100,
+            temperature=0.7,
+            top_p=1.0,
+            stop=self.renderer.get_stop_sequences(),
+        )
+        num_correct = 0
+        for datum in self.dataset:
+            model_input = self.renderer.build_generation_prompt(
+                [renderers.Message(role="user", content=datum["input"])]
+            )
+            r = await sampling_client.sample_async(
+                prompt=model_input, num_samples=1, sampling_params=sampling_params
+            )
+            tokens = r.sequences[0].tokens
+            response = self.renderer.parse_response(tokens)[0]
+            if self.grader_fn(response["content"], datum["output"]):
+                num_correct += 1
+        return {"accuracy": num_correct / len(self.dataset)}
+```
+### Usage
+```python
+import asyncio
+QA_DATASET = [
+    {"input": "What is the capital of France?", "output": "Paris"},
+    {"input": "What is the capital of Germany?", "output": "Berlin"},
+]
+def grader_fn(response: str, target: str) -> bool:
+    return target.lower() in response.lower()
+evaluator = CustomEvaluator(
+    dataset=QA_DATASET,
+    grader_fn=grader_fn,
+    renderer_name="llama3",
+    model_name="meta-llama/Llama-3.1-8B-Instruct",
+)
+service_client = tinker.ServiceClient()
+sampling_client = service_client.create_sampling_client(
+    base_model="meta-llama/Llama-3.1-8B-Instruct"
+)
+async def main():
+    result = await evaluator(sampling_client)
+    print(result)  # {"accuracy": 1.0}
+asyncio.run(main())
+```
+## Evaluation Strategy
+| Stage | Method | When |
+|-------|--------|------|
+| During SFT | `evaluator_builders` | Every N training steps |
+| During RL | `evaluator_builders` (SamplingClientEvaluator) | Every N iterations |
+| After training | `run_inspect_evals` CLI | On final checkpoint |
+| Custom tasks | Custom `SamplingClientEvaluator` | Any time with sampling client |
+| LLM-as-judge | Inspect AI with `model_graded_qa` | When automated grading needed |

package/bin/skills/tinker/references/getting-started.md CHANGED Viewed

@@ -61,7 +61,7 @@ import numpy as np
 for _ in range(6):
     fwdbwd_future = training_client.forward_backward(processed_examples, "cross_entropy")
-    optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4))
+    optim_future = training_client.optim_step(types.AdamParams(learning_rate=2e-4))  # Use get_lr() for production
     fwdbwd_result = fwdbwd_future.result()
     optim_result = optim_future.result()

package/bin/skills/tinker/references/models-and-lora.md CHANGED Viewed

@@ -3,7 +3,7 @@
 ## Model Selection Guide
 - **Use MoE models** - More cost effective than dense
-- **Base models** - Only for research or full post-training
+- **Base models** - For LoRA-based post-training research (no built-in chat format)
 - **Instruction models** - Fast inference, no chain-of-thought
 - **Hybrid/Reasoning models** - Long chain-of-thought for quality
@@ -36,7 +36,7 @@
 **Sizes:** Compact (1-4B), Small (8B), Medium (30-32B), Large (70B+)
 **Types:**
-- **Base**: Pretrained, for post-training research
+- **Base**: Pretrained, no chat formatting — use for LoRA post-training research
 - **Instruction**: Chat-tuned, fast inference
 - **Hybrid**: Thinking + non-thinking modes
 - **Reasoning**: Always uses chain-of-thought
@@ -51,6 +51,12 @@ LoRA (Low-Rank Adaptation) fine-tunes small parameter subset instead of all weig
 - SL on small-medium instruction datasets: **Same as full fine-tuning**
 - RL: **Equivalent to full fine-tuning even with small ranks**
 - Large datasets: May underperform (increase rank)
+- LoRA performs better when applied to **all weight matrices** (attention + MLP + MoE). Attention-only LoRA underperforms even with matched parameter counts
+### LoRA Limitations
+- **Large batch sizes**: LoRA is less tolerant of large batch sizes than full FT — pays a larger loss penalty as batch size increases beyond some point. This penalty is NOT mitigated by increasing rank; it's a property of the product-of-matrices parametrization
+- **Large SL datasets**: When dataset exceeds LoRA capacity, results in worse training efficiency (not a distinct floor)
 ### LoRA Learning Rate
@@ -61,7 +67,10 @@ from tinker_cookbook.hyperparam_utils import get_lora_lr_over_full_finetune_lr
 model_name = "meta-llama/Llama-3.1-8B"
 factor = get_lora_lr_over_full_finetune_lr(model_name)
-# Returns 10.0 for all models (empirically validated)
+# Factor varies by model size:
+#   Llama-3.2-1B  → 32
+#   Llama-3.1-8B  → ~50
+#   Llama-3.1-70B → 128
 ```
 ### Recommended Learning Rate

package/bin/skills/tinker/references/recipes.md CHANGED Viewed

@@ -12,6 +12,7 @@ from tinker_cookbook.renderers import TrainOnWhat
 from tinker_cookbook.supervised import train
 from tinker_cookbook.supervised.data import FromConversationFileBuilder
 from tinker_cookbook.supervised.types import ChatDatasetBuilderCommonConfig
+from tinker_cookbook.hyperparam_utils import get_lr
 def build_config_blueprint() -> chz.Blueprint[train.Config]:
     model_name = "meta-llama/Llama-3.1-8B"
@@ -35,7 +36,7 @@ def build_config_blueprint() -> chz.Blueprint[train.Config]:
         "log_path": "/tmp/tinker-examples/sl_basic",
         "model_name": model_name,
         "dataset_builder": dataset,
-        "learning_rate": 2e-4,
+        "learning_rate": get_lr(model_name),
         "lr_schedule": "linear",
         "num_epochs": 1,
         "eval_every": 8,
@@ -61,13 +62,14 @@ from tinker_cookbook import checkpoint_utils, model_info, renderers
 from tinker_cookbook.supervised.common import compute_mean_nll
 from tinker_cookbook.supervised.data import conversation_to_datum
 from tinker_cookbook.tokenizer_utils import get_tokenizer
+from tinker_cookbook.hyperparam_utils import get_lr
 @chz.chz
 class Config:
     log_path: str = "/tmp/tinker-examples/sl-loop"
     model_name: str = "meta-llama/Llama-3.1-8B"
     batch_size: int = 128
-    learning_rate: float = 1e-4
+    learning_rate: float = get_lr("meta-llama/Llama-3.1-8B")  # ~2.8e-4
     max_length: int = 32768
     train_on_what: renderers.TrainOnWhat = renderers.TrainOnWhat.ALL_ASSISTANT_MESSAGES
     lora_rank: int = 32
@@ -257,6 +259,41 @@ if __name__ == "__main__":
     chz.nested_entrypoint(main)
 ```
+## DPO / Preference Learning
+```bash
+python -m tinker_cookbook.recipes.preference.train \
+    log_path=/tmp/dpo-experiment \
+    model_name=meta-llama/Llama-3.2-1B \
+    dataset=hhh \
+    renderer_name=role_colon \
+    learning_rate=1e-5 \
+    dpo_beta=0.1
+```
+Available datasets: `hhh` (Anthropic), `helpsteer3` (NVIDIA), `ultrafeedback`
+See [DPO & Preference Learning](dpo-and-preference.md) for full RLHF pipeline and parameters.
+## Prompt Distillation
+```bash
+# Step 1: Generate training data with teacher model
+python -m tinker_cookbook.recipes.prompt_distillation.create_data \
+    output_file=/tmp/tinker-datasets/prompt_distillation_lang.jsonl
+# Step 2: Train student model on distilled data
+python -m tinker_cookbook.recipes.prompt_distillation.train
+```
+## Multi-Step RL (Twenty Questions)
+```bash
+python -m tinker_cookbook.recipes.twenty_questions.train
+```
+Complete multi-step environment where a question-asking agent learns to guess hidden words. Good reference for building custom multi-turn RL environments.
 ## Running Recipes
 ```bash
@@ -271,6 +308,15 @@ python -m tinker_cookbook.recipes.rl_basic
 # Manual RL loop
 python -m tinker_cookbook.recipes.rl_loop
+# DPO
+python -m tinker_cookbook.recipes.preference.train
+# Prompt distillation
+python -m tinker_cookbook.recipes.prompt_distillation.train
+# Multi-step RL
+python -m tinker_cookbook.recipes.twenty_questions.train
 ```
 ## CLI Overrides

package/bin/skills/tinker/references/reinforcement-learning.md CHANGED Viewed

@@ -11,6 +11,8 @@ Fine-tunes Llama-3.1-8B on GSM8K with reward:
 1[answer correct] + 0.1 * (1[format correct] - 1)
 ```
+Training takes ~1 min/iteration, reaches ~63% accuracy after 15 iterations.
 ## Basic RL Config
 ```python
@@ -54,6 +56,93 @@ if __name__ == "__main__":
 - `entropy` - Per-token entropy
 - `kl_sample_train_v1/v2` - KL divergence (sampler vs learner)
+## RL Environment Classes
+Custom RL environments use three classes from `tinker_cookbook.rl.types`:
+### Env
+Stateful environment for a single agent episode. Discard after one episode.
+```python
+from tinker_cookbook.rl.types import Env, Observation, StopCondition, Action, StepResult
+class MyEnv(Env):
+    async def initial_observation(self) -> tuple[Observation, StopCondition]:
+        # Return initial tokens and stop condition
+        raise NotImplementedError
+    async def step(self, action: Action) -> StepResult:
+        # Process agent action, return next observation + reward
+        raise NotImplementedError
+```
+Note: `Env` operates on tokens (not strings/messages) because the training code needs exact tokens and logprobs.
+### EnvGroupBuilder
+Creates groups of environments (enables multi-agent training or paired comparisons):
+```python
+from tinker_cookbook.rl.types import EnvGroupBuilder
+class MyEnvGroupBuilder(EnvGroupBuilder):
+    async def make_envs(self) -> list[Env]:
+        return [MyEnv() for _ in range(group_size)]
+```
+### RLDataset
+Dataset of `EnvGroupBuilder` objects:
+```python
+from tinker_cookbook.rl.types import RLDataset
+class MyDataset(RLDataset):
+    def get_batch(self, index: int) -> list[EnvGroupBuilder]:
+        return [MyEnvGroupBuilder(problem) for problem in self.problems[index]]
+```
+### Multi-Step Environment Example
+See `tinker_cookbook.recipes.twenty_questions` for a complete multi-step environment where a question-asking agent learns to guess hidden words:
+```bash
+python -m tinker_cookbook.recipes.twenty_questions.train
+```
+## Completers (Policy Abstractions)
+Completers represent models/policies that can be sampled from.
+### TokenCompleter (for RL training)
+```python
+from tinker_cookbook.rl.types import TokenCompleter, TokensWithLogprobs
+class TokenCompleter:
+    async def __call__(
+        self, model_input: types.ModelInput, stop: StopCondition
+    ) -> TokensWithLogprobs:
+        ...
+```
+Used by RL algorithms because they work directly with tokens.
+### MessageCompleter (for inference/judging)
+```python
+from tinker_cookbook.rl.types import MessageCompleter
+class MessageCompleter:
+    async def __call__(self, messages: list[renderers.Message]) -> renderers.Message:
+        ...
+```
+Operates at message level — useful for judge models, multi-agent environments, and evaluation. Requires a renderer to convert messages to tokens for the sampling client.
+Concrete implementations: `TinkerTokenCompleter` and `TinkerMessageCompleter` (wrappers around `tinker.SamplingClient`).
 ## Custom RL Loop
 ```python
@@ -143,12 +232,18 @@ def main(config: Config):
 ## Hyperparameters
+### Learning Rate
+Same guidance as SL — use `get_lr(model_name)` as starting point. The rl_basic recipe uses `4e-5` for Llama-3.1-8B.
 ### Batch and Group Sizes
-- `batch_size`: Number of unique problems
+- `batch_size`: Number of unique problems per iteration
 - `group_size`: Rollouts per problem (for variance reduction)
-Scale: `LR ∝ √batch_size`
+Scale LR proportionally: `LR ∝ √batch_size`
+If you have limited problems, increase `group_size` to generate more training data.
 ### Multiple Updates (num_substeps)
@@ -157,14 +252,14 @@ Scale: `LR ∝ √batch_size`
 num_substeps = 1
 # Multiple updates: split batch into mini-batches
-num_substeps = 4  # Batch must be divisible
+num_substeps = 4  # Batch must be divisible by num_substeps
 ```
-Use with PPO objective. Start with 2-4.
+Use with PPO objective. Start with 2-4. Higher values risk out-of-distribution updates.
 ### Streaming Minibatch Training
-Overlaps sampling and training for throughput:
+Overlaps sampling and training for throughput (on-policy, pipeline efficiency only):
 ```python
 StreamMinibatchConfig(
@@ -175,22 +270,72 @@ StreamMinibatchConfig(
 ### Async Off-Policy Training
-For long rollouts:
+For long rollouts (CoT, tool use, agentic workflows):
 ```python
 AsyncConfig(
-    max_steps_off_policy=3,  # Max age of trajectories
+    max_steps_off_policy=3,  # Max age of trajectories before discard
     groups_per_batch=64,
 )
 ```
+Start with `max_steps_off_policy < 5`. Monitor KL divergence carefully.
+## Sequence Extension Property
+Critical for multi-turn RL efficiency. When consecutive timesteps satisfy the extension property (each observation is a prefix extension of the previous), compute scales O(T) instead of O(T²).
+### When Extension Holds
+With `Qwen3Renderer(strip_thinking_from_history=False)`, thinking blocks are preserved:
+```
+Timestep 1: [User: Q1] [A: <think>...</think> A1] [User:]
+Timestep 2: [User: Q1] [A: <think>...</think> A1] [User: Q2] [A: <think>...</think> A2] [User:]
+```
+Timestep 2 contains timestep 1 as a prefix → single Datum, KV-cache reuse.
+### When Extension Breaks
+With default `strip_thinking_from_history=True`, `<think>` blocks are stripped from history:
+```
+Timestep 1: [User: Q1] [A: <think>...</think> A1] [User:]
+Timestep 2: [User: Q1] [A: A1] [User: Q2] [A: <think>...</think> A2] [User:]
+```
+Prefix doesn't match → separate Datums per timestep, O(T²) compute.
+### Check Programmatically
+```python
+renderer.has_extension_property  # True or False
+```
+For `Qwen3Renderer`:
+- `strip_thinking_from_history=False` → `has_extension_property=True`
+- `strip_thinking_from_history=True` (default) → `has_extension_property=False`
+### Periodic Compaction (Hybrid)
+Keep thinking visible most of the time, periodically strip old thinking blocks:
+- Turns 1-N: keep thinking visible (extension holds, single datum)
+- Turn N+1: strip thinking from turns 1-N (extension breaks once)
+- Turns N+1 to 2N: keep thinking visible again
+- Repeat every N turns
+This amortizes recomputation cost over N turns with bounded context growth.
 ## Monitoring
 ### KL Divergence
 Monitor `kl_sample_train_v1/v2`:
 - Should stay below 0.01 for stable training
-- High KL indicates numerical instability
+- Even on-policy training shows non-zero KL (implementation detail)
+- KL crossing threshold indicates numerical instability
 ### Reward Curves

package/bin/skills/tinker/references/rendering.md CHANGED Viewed

@@ -49,10 +49,22 @@ model_input, weights = renderer.build_supervised_example(
     messages,
     train_on_what=TrainOnWhat.ALL_ASSISTANT_MESSAGES,
 )
-# model_input: ModelInput (list of chunks)
+# model_input: ModelInput (token sequence)
 # weights: per-token weights (0.0 = prompt, 1.0 = train)
 ```
+**Default behavior:** If `train_on_what` is omitted, defaults to `LAST_ASSISTANT_MESSAGE`. Always specify explicitly to avoid surprises.
+```python
+# WRONG — silently trains only on last assistant message
+model_input, weights = renderer.build_supervised_example(messages)
+# RIGHT — explicitly train on all assistant messages
+model_input, weights = renderer.build_supervised_example(
+    messages, train_on_what=TrainOnWhat.ALL_ASSISTANT_MESSAGES
+)
+```
 ### build_generation_prompt
 For inference:

package/bin/skills/tinker/references/supervised-learning.md CHANGED Viewed

@@ -17,6 +17,7 @@ from tinker_cookbook.supervised.types import ChatDatasetBuilderCommonConfig
 from tinker_cookbook.supervised.data import FromConversationFileBuilder
 from tinker_cookbook.renderers import TrainOnWhat
 from tinker_cookbook.model_info import get_recommended_renderer_name
+from tinker_cookbook.hyperparam_utils import get_lr
 def build_config_blueprint() -> chz.Blueprint[train.Config]:
     model_name = "meta-llama/Llama-3.1-8B"
@@ -39,8 +40,8 @@ def build_config_blueprint() -> chz.Blueprint[train.Config]:
         "log_path": "/tmp/training",
         "model_name": model_name,
         "dataset_builder": dataset_builder,
-        "learning_rate": 2e-4,
-        "lr_schedule": "cosine",
+        "learning_rate": get_lr(model_name),
+        "lr_schedule": "linear",
         "num_epochs": 3,
         "lora_rank": 32,
     })
@@ -182,19 +183,23 @@ class CustomDataset(SupervisedDataset):
     def __iter__(self):
         for item in self.data:
             messages = self._preprocess(item)
-            example = self.renderer.build_supervised_example(
+            # build_supervised_example returns (model_input, weights) tuple
+            model_input, weights = self.renderer.build_supervised_example(
                 messages=messages,
                 train_on_what=TrainOnWhat.ALL_ASSISTANT_MESSAGES,
             )
+            tokens = model_input.to_ints()
             yield Datum(
-                model_input=ModelInput([example.chunk]),
+                model_input=ModelInput.from_ints(tokens=tokens[:-1]),
                 loss_fn_inputs={
-                    "target_tokens": TensorData.from_numpy(np.array(example.target_tokens, dtype=np.int64)),
-                    "weights": TensorData.from_numpy(np.array(example.weights, dtype=np.float32)),
+                    "target_tokens": TensorData.from_numpy(np.array(tokens[1:], dtype=np.int64)),
+                    "weights": TensorData.from_numpy(np.array(weights[1:], dtype=np.float32)),
                 },
             )
 ```
+**Note:** `build_supervised_example` returns a tuple `(model_input, weights)`. The `model_input` is a `ModelInput` (token sequence), and `weights` is a list of per-token floats (0.0 for prompt tokens, 1.0 for completion tokens). When constructing `Datum` manually, shift by one position: input is `tokens[:-1]`, targets are `tokens[1:]`, weights are `weights[1:]`.
 ## Hyperparameters
 ### Learning Rate
@@ -215,7 +220,26 @@ Formula: `LR = lr_base * M_LoRA * (2000/H_m)^P_m`
 - Smaller batch sizes (128) generally better for fine-tuning
 - Scale LR with `LR ∝ √batch_size`
-- Aim for at least 100 steps of training
+- Aim for at least 100 steps of training (1000+ steps for best results)
+## Checkpoints
+Two types of saved weights:
+| Method | Path Contains | Use Case |
+|--------|--------------|----------|
+| `save_weights_for_sampler()` | `/sampler_weights/` | Sampling/inference only (lightweight) |
+| `save_state()` | `/weights/` | Full optimizer state for resuming training |
+```python
+# Save for sampling
+path = training_client.save_weights_for_sampler(name="final").result().path
+sampling_client = service_client.create_sampling_client(model_path=path)
+# Save full state for resuming
+state_path = training_client.save_state(name="checkpoint").result().path
+training_client.load_state(state_path)  # Resume later
+```
 ## Output Files

package/bin/synsc CHANGED Viewed

Binary file

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@synsci/cli-linux-x64-musl",
-  "version": "1.1.50",
+  "version": "1.1.52",
   "os": [
     "linux"
   ],