@synsci/cli-darwin-x64-baseline 1.1.72 → 1.1.74
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/llm-as-judge-evaluation/SKILL.md +385 -0
- package/bin/skills/llm-as-judge-evaluation/references/pairwise-comparison.md +95 -0
- package/bin/skills/llm-as-judge-evaluation/references/scoring-rubrics.md +169 -0
- package/bin/skills/model-economics/SKILL.md +238 -0
- package/bin/skills/training-data-pipeline/SKILL.md +427 -0
- package/bin/skills/training-data-pipeline/references/data-quality.md +136 -0
- package/bin/skills/training-data-pipeline/references/frontier-distillation.md +129 -0
- package/bin/skills/training-data-pipeline/references/production-data-formatting.md +126 -0
- package/bin/synsc +0 -0
- package/package.json +1 -1
|
@@ -0,0 +1,238 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: model-economics
|
|
3
|
+
description: Cost modeling and ROI analysis for specialized LLM development. Use when deciding whether to train a custom model, estimating total cost, or calculating break-even vs frontier APIs. Covers training costs, inference costs, and time-to-ROI projections.
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Synthetic Sciences
|
|
6
|
+
license: MIT
|
|
7
|
+
tags: [Economics, Cost Analysis, ROI, Break-Even, Training Cost, Inference Cost, Model Flywheel, Business Case]
|
|
8
|
+
dependencies: []
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Model Economics: Cost Analysis for LLM Specialization
|
|
12
|
+
|
|
13
|
+
## When to Use This Skill
|
|
14
|
+
|
|
15
|
+
Use this skill as the **first step** of any flywheel assessment to determine:
|
|
16
|
+
- Is it worth training a specialized model?
|
|
17
|
+
- What will the total investment be?
|
|
18
|
+
- When will the specialized model break even vs frontier APIs?
|
|
19
|
+
- What's the ongoing cost comparison?
|
|
20
|
+
|
|
21
|
+
## Frontier API Cost Model
|
|
22
|
+
|
|
23
|
+
**Important**: Use `websearch` to get current pricing — these change frequently. Reference tiers as of early 2026:
|
|
24
|
+
|
|
25
|
+
### Premium Tier ($5-25/M output tokens)
|
|
26
|
+
| Model | Input ($/M) | Output ($/M) | Notes |
|
|
27
|
+
|-------|-------------|--------------|-------|
|
|
28
|
+
| Claude Opus 4.6 | $5.00 | $25.00 | Highest capability |
|
|
29
|
+
| GPT-5.2 Pro | $21.00 | $168.00 | Reasoning-heavy |
|
|
30
|
+
|
|
31
|
+
**Savings potential**: Highest. A specialized 8B model costs ~$0.05-0.15/M tokens to serve = 100-500x cheaper.
|
|
32
|
+
|
|
33
|
+
### Standard Tier ($1.75-14/M output tokens)
|
|
34
|
+
| Model | Input ($/M) | Output ($/M) | Notes |
|
|
35
|
+
|-------|-------------|--------------|-------|
|
|
36
|
+
| GPT-5.2 Chat | $1.75 | $14.00 | Most popular |
|
|
37
|
+
| Claude Sonnet 4.5 | $3.00 | $15.00 | Good balance |
|
|
38
|
+
| Qwen3 Max | $1.20 | $6.00 | Cost-effective |
|
|
39
|
+
|
|
40
|
+
**Savings potential**: Strong at volume. Break-even is achievable with moderate traffic.
|
|
41
|
+
|
|
42
|
+
### Budget Tier ($0.10-3/M output tokens)
|
|
43
|
+
| Model | Input ($/M) | Output ($/M) | Notes |
|
|
44
|
+
|-------|-------------|--------------|-------|
|
|
45
|
+
| Gemini 3 Flash | $0.50 | $3.00 | Fast, cheap |
|
|
46
|
+
| GLM 4.7 | $0.40 | $1.50 | Very cost-effective |
|
|
47
|
+
| Mistral Small | $0.10 | $0.30 | Cheapest commercial |
|
|
48
|
+
|
|
49
|
+
**Savings potential**: Harder to beat on cost alone. Specialize for quality, not cost.
|
|
50
|
+
|
|
51
|
+
### Free / Near-Free
|
|
52
|
+
Many open-weight models available on OpenRouter at $0. If you only need basic capability, self-hosting these is essentially free beyond GPU cost.
|
|
53
|
+
|
|
54
|
+
## Training Cost Estimation
|
|
55
|
+
|
|
56
|
+
### Tinker (Managed LoRA)
|
|
57
|
+
Cheapest option for supported models. No GPU management.
|
|
58
|
+
|
|
59
|
+
| Model Size | Method | Cost/1K examples | Duration |
|
|
60
|
+
|------------|--------|-------------------|----------|
|
|
61
|
+
| 8B | LoRA | ~$2-5 | 15-30 min |
|
|
62
|
+
| 14B | LoRA | ~$5-10 | 30-60 min |
|
|
63
|
+
| 32B | LoRA | ~$10-25 | 1-3 hours |
|
|
64
|
+
| 70B | LoRA | ~$25-60 | 3-8 hours |
|
|
65
|
+
|
|
66
|
+
Load `tinker-training-cost` skill for exact per-token rates.
|
|
67
|
+
|
|
68
|
+
### Modal (Serverless GPU)
|
|
69
|
+
|
|
70
|
+
| GPU | Cost/Hour | Best For |
|
|
71
|
+
|-----|-----------|----------|
|
|
72
|
+
| A100-40GB | ~$1.10 | 7-14B LoRA, 7B full |
|
|
73
|
+
| A100-80GB | ~$1.60 | 14-32B LoRA, 14B full |
|
|
74
|
+
| H100-80GB | ~$3.25 | 32-70B LoRA, large batches |
|
|
75
|
+
|
|
76
|
+
**Typical training runs**:
|
|
77
|
+
- 8B LoRA, 5K examples, 3 epochs: ~1-2 hours on A100 = $1-3
|
|
78
|
+
- 14B LoRA, 10K examples, 3 epochs: ~3-5 hours on A100-80GB = $5-8
|
|
79
|
+
- 70B LoRA, 10K examples, 2 epochs: ~8-12 hours on H100 = $26-39
|
|
80
|
+
|
|
81
|
+
### TensorPool (Dedicated GPU)
|
|
82
|
+
|
|
83
|
+
| GPU | Cost/Hour | Best For |
|
|
84
|
+
|-----|-----------|----------|
|
|
85
|
+
| A100-80GB | ~$1.50 | Standard training |
|
|
86
|
+
| H200-141GB | ~$3.00 | Large models, fast training |
|
|
87
|
+
| B200-192GB | ~$4.50 | Cutting-edge, maximum throughput |
|
|
88
|
+
|
|
89
|
+
Best for: Multi-day training, large-scale RL, dedicated inference.
|
|
90
|
+
|
|
91
|
+
## Inference Cost Comparison
|
|
92
|
+
|
|
93
|
+
The key calculation: what does it cost to serve the specialized model vs the frontier API?
|
|
94
|
+
|
|
95
|
+
### Self-Hosted Inference Cost
|
|
96
|
+
|
|
97
|
+
```python
|
|
98
|
+
def inference_cost_per_million_tokens(gpu_cost_per_hour, tokens_per_second):
|
|
99
|
+
"""Calculate cost per million tokens for self-hosted inference.
|
|
100
|
+
|
|
101
|
+
Args:
|
|
102
|
+
gpu_cost_per_hour: GPU rental cost (e.g., $3.25 for H100)
|
|
103
|
+
tokens_per_second: Model throughput (varies by model size and batch)
|
|
104
|
+
"""
|
|
105
|
+
tokens_per_hour = tokens_per_second * 3600
|
|
106
|
+
cost_per_token = gpu_cost_per_hour / tokens_per_hour
|
|
107
|
+
cost_per_million = cost_per_token * 1_000_000
|
|
108
|
+
return cost_per_million
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
### Reference Throughput (vLLM, single GPU)
|
|
112
|
+
|
|
113
|
+
| Model Size | GPU | Tokens/sec (batch) | Cost/M tokens |
|
|
114
|
+
|------------|-----|-------------------|---------------|
|
|
115
|
+
| 8B | A100-80GB | ~3,000-5,000 | $0.09-0.15 |
|
|
116
|
+
| 8B | H100-80GB | ~5,000-8,000 | $0.11-0.18 |
|
|
117
|
+
| 14B | A100-80GB | ~1,500-3,000 | $0.15-0.30 |
|
|
118
|
+
| 14B | H100-80GB | ~3,000-5,000 | $0.18-0.30 |
|
|
119
|
+
| 32B | H100-80GB | ~800-1,500 | $0.60-1.10 |
|
|
120
|
+
| 70B | 2xH100 | ~500-1,000 | $1.80-3.60 |
|
|
121
|
+
|
|
122
|
+
**Key insight**: A specialized 8B model on Modal/TensorPool costs **$0.05-0.15/M tokens** vs **$5-25/M** for premium frontier = **50-500x cheaper per request**.
|
|
123
|
+
|
|
124
|
+
## Break-Even Calculator
|
|
125
|
+
|
|
126
|
+
```python
|
|
127
|
+
def break_even_analysis(
|
|
128
|
+
monthly_api_spend: float,
|
|
129
|
+
training_cost: float,
|
|
130
|
+
monthly_inference_cost: float,
|
|
131
|
+
setup_hours: float = 40,
|
|
132
|
+
hourly_rate: float = 100,
|
|
133
|
+
):
|
|
134
|
+
"""Calculate break-even timeline for model specialization.
|
|
135
|
+
|
|
136
|
+
Args:
|
|
137
|
+
monthly_api_spend: Current monthly frontier API cost
|
|
138
|
+
training_cost: One-time training cost (GPU + data prep)
|
|
139
|
+
monthly_inference_cost: Monthly cost to serve specialized model
|
|
140
|
+
setup_hours: Engineering time to set up pipeline
|
|
141
|
+
hourly_rate: Engineering cost per hour
|
|
142
|
+
"""
|
|
143
|
+
engineering_cost = setup_hours * hourly_rate
|
|
144
|
+
total_upfront = training_cost + engineering_cost
|
|
145
|
+
monthly_savings = monthly_api_spend - monthly_inference_cost
|
|
146
|
+
|
|
147
|
+
if monthly_savings <= 0:
|
|
148
|
+
return {
|
|
149
|
+
"viable": False,
|
|
150
|
+
"reason": "Specialized model costs more to serve than frontier API",
|
|
151
|
+
"monthly_savings": monthly_savings,
|
|
152
|
+
}
|
|
153
|
+
|
|
154
|
+
months_to_break_even = total_upfront / monthly_savings
|
|
155
|
+
|
|
156
|
+
return {
|
|
157
|
+
"viable": True,
|
|
158
|
+
"total_upfront_cost": total_upfront,
|
|
159
|
+
"training_cost": training_cost,
|
|
160
|
+
"engineering_cost": engineering_cost,
|
|
161
|
+
"monthly_api_spend": monthly_api_spend,
|
|
162
|
+
"monthly_inference_cost": monthly_inference_cost,
|
|
163
|
+
"monthly_savings": monthly_savings,
|
|
164
|
+
"months_to_break_even": round(months_to_break_even, 1),
|
|
165
|
+
"annual_savings": monthly_savings * 12 - total_upfront,
|
|
166
|
+
"year_2_savings": monthly_savings * 12, # Fully amortized
|
|
167
|
+
}
|
|
168
|
+
|
|
169
|
+
|
|
170
|
+
# Example: Replace GPT-4o for a specific task
|
|
171
|
+
result = break_even_analysis(
|
|
172
|
+
monthly_api_spend=5000, # $5K/month on GPT-4o
|
|
173
|
+
training_cost=500, # $500 for LoRA fine-tune + data prep
|
|
174
|
+
monthly_inference_cost=200, # $200/month on Modal (8B model)
|
|
175
|
+
setup_hours=40, # 1 week of engineering
|
|
176
|
+
hourly_rate=100, # $100/hr engineer
|
|
177
|
+
)
|
|
178
|
+
# Result: Break-even in ~1.1 months, $52K annual savings after year 1
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
## ROI Timeline Template
|
|
182
|
+
|
|
183
|
+
| Month | Frontier Cost | Specialized Cost | Cumulative Savings | Notes |
|
|
184
|
+
|-------|--------------|-----------------|-------------------|-------|
|
|
185
|
+
| 0 | — | $4,500 upfront | -$4,500 | Training + engineering |
|
|
186
|
+
| 1 | $5,000 | $200 | $300 | First month live |
|
|
187
|
+
| 2 | $5,000 | $200 | $5,100 | Break-even |
|
|
188
|
+
| 3 | $5,000 | $200 | $9,900 | |
|
|
189
|
+
| 6 | $5,000 | $200 | $24,300 | |
|
|
190
|
+
| 12 | $5,000 | $300 | $52,200 | Includes retrain cost |
|
|
191
|
+
|
|
192
|
+
## Decision Framework
|
|
193
|
+
|
|
194
|
+
### Strong Case for Specialization
|
|
195
|
+
- Monthly API spend > $1,000
|
|
196
|
+
- Constrained task (classification, extraction, formatting, domain Q&A)
|
|
197
|
+
- Production data available (logs, corrections, accept/reject signals)
|
|
198
|
+
- Quality requirements are well-defined and measurable
|
|
199
|
+
- Task doesn't change frequently
|
|
200
|
+
|
|
201
|
+
### Weak Case for Specialization
|
|
202
|
+
- Monthly API spend < $500
|
|
203
|
+
- Diverse, open-ended tasks (general assistant)
|
|
204
|
+
- No production data to train on
|
|
205
|
+
- Rapid model improvement expected (new frontier models releasing soon)
|
|
206
|
+
- Task requires reasoning at the frontier level
|
|
207
|
+
- Small team with no ML engineering capacity
|
|
208
|
+
|
|
209
|
+
### Hybrid Approach
|
|
210
|
+
Route easy requests to specialized model, hard ones to frontier:
|
|
211
|
+
- **90% of requests** → Specialized 8B (cheap, fast)
|
|
212
|
+
- **10% of requests** → Frontier fallback (complex, novel)
|
|
213
|
+
- **Result**: 80-90% cost reduction while maintaining quality on hard cases
|
|
214
|
+
|
|
215
|
+
## Quick Assessment Questions
|
|
216
|
+
|
|
217
|
+
Ask the user these to determine viability:
|
|
218
|
+
|
|
219
|
+
1. **What's your monthly frontier API spend?**
|
|
220
|
+
- < $500: Probably not worth it (low ROI)
|
|
221
|
+
- $500-2K: Worth investigating
|
|
222
|
+
- $2K+: Strong candidate
|
|
223
|
+
- $10K+: Almost certainly worth it
|
|
224
|
+
|
|
225
|
+
2. **How constrained is the task?**
|
|
226
|
+
- Single task (classification, extraction): Great fit
|
|
227
|
+
- Few related tasks: Good fit
|
|
228
|
+
- Many diverse tasks: Harder to specialize
|
|
229
|
+
|
|
230
|
+
3. **What production data do you have?**
|
|
231
|
+
- API logs with user feedback: Excellent
|
|
232
|
+
- API logs without feedback: Good (can distill)
|
|
233
|
+
- No production data: Need synthetic bootstrap
|
|
234
|
+
|
|
235
|
+
4. **What's your quality bar?**
|
|
236
|
+
- Must match frontier exactly: Harder, may need bigger model
|
|
237
|
+
- 90% of frontier quality is fine: 8B LoRA likely sufficient
|
|
238
|
+
- Task-specific metrics (accuracy, format): Easiest to optimize
|
|
@@ -0,0 +1,427 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: training-data-pipeline
|
|
3
|
+
description: Build training datasets for LLM specialization from production data, frontier model distillation, and synthetic bootstrapping. Use when formatting production logs into SFT data, distilling from frontier APIs, or preparing data for fine-tuning. Covers JSONL formatting, data quality validation, deduplication, and train/eval splitting.
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Synthetic Sciences
|
|
6
|
+
license: MIT
|
|
7
|
+
tags: [Training Data, Data Pipeline, Fine-Tuning, Distillation, Synthetic Data, Production Data, JSONL, Data Quality]
|
|
8
|
+
dependencies: [datasets, transformers, openai]
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Training Data Pipeline
|
|
12
|
+
|
|
13
|
+
## When to Use This Skill
|
|
14
|
+
|
|
15
|
+
Use this skill when you need to:
|
|
16
|
+
- **Format production logs** into SFT training data (API logs, user corrections, accept/reject signals)
|
|
17
|
+
- **Distill from frontier models** using batch APIs (OpenAI, Anthropic) to label production inputs
|
|
18
|
+
- **Bootstrap synthetic data** when fewer than 1000 real examples exist
|
|
19
|
+
- **Validate data quality** before training (dedup, schema check, diversity metrics)
|
|
20
|
+
- **Split data** into train/eval sets with production data reserved for evaluation
|
|
21
|
+
|
|
22
|
+
### Three Data Paths
|
|
23
|
+
|
|
24
|
+
| Path | When to Use | Data Source | Cost |
|
|
25
|
+
|------|-------------|-------------|------|
|
|
26
|
+
| A) Production data | Have API logs or user feedback | Your own production systems | Free (already collected) |
|
|
27
|
+
| B) Frontier distillation | Have production inputs but no labels | OpenAI/Anthropic batch APIs | ~50% of real-time API cost |
|
|
28
|
+
| C) Synthetic bootstrap | < 1000 real examples | Frontier model generation | Varies by volume |
|
|
29
|
+
|
|
30
|
+
**Always prefer Path A** — production data is the moat competitors can't replicate.
|
|
31
|
+
|
|
32
|
+
## JSONL Chat Format
|
|
33
|
+
|
|
34
|
+
All training platforms (Tinker, Unsloth, TRL, Axolotl) accept this standard chat format:
|
|
35
|
+
|
|
36
|
+
```jsonl
|
|
37
|
+
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "4"}]}
|
|
38
|
+
{"messages": [{"role": "user", "content": "Translate to French: Hello"}, {"role": "assistant", "content": "Bonjour"}]}
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
### Format Rules
|
|
42
|
+
- One JSON object per line, no trailing commas
|
|
43
|
+
- `messages` array with `role` and `content` fields
|
|
44
|
+
- Roles: `system` (optional, first only), `user`, `assistant` (alternating)
|
|
45
|
+
- Multi-turn: alternate user/assistant pairs within a single messages array
|
|
46
|
+
- UTF-8 encoding, no BOM
|
|
47
|
+
- `assistant` messages are the training targets — everything else is context
|
|
48
|
+
|
|
49
|
+
### Platform-Specific Notes
|
|
50
|
+
|
|
51
|
+
**Tinker**: Standard chat format above. Max 32K tokens per example. System message optional.
|
|
52
|
+
|
|
53
|
+
**Unsloth/TRL**: Same format. Also accepts `{"prompt": "...", "completion": "..."}` for simple pairs. Chat format preferred for multi-turn.
|
|
54
|
+
|
|
55
|
+
**Axolotl**: Supports multiple formats via config. Recommend `chat_template` type with standard JSONL.
|
|
56
|
+
|
|
57
|
+
## Path A: Production Data Collection
|
|
58
|
+
|
|
59
|
+
### From API Logs
|
|
60
|
+
|
|
61
|
+
If you log API requests/responses, convert them directly:
|
|
62
|
+
|
|
63
|
+
```python
|
|
64
|
+
import json
|
|
65
|
+
|
|
66
|
+
def api_log_to_training(log_entry):
|
|
67
|
+
"""Convert an API request/response log to training format."""
|
|
68
|
+
messages = []
|
|
69
|
+
|
|
70
|
+
# Add system prompt if present
|
|
71
|
+
if log_entry.get("system_prompt"):
|
|
72
|
+
messages.append({
|
|
73
|
+
"role": "system",
|
|
74
|
+
"content": log_entry["system_prompt"]
|
|
75
|
+
})
|
|
76
|
+
|
|
77
|
+
# Add the user's input
|
|
78
|
+
messages.append({
|
|
79
|
+
"role": "user",
|
|
80
|
+
"content": log_entry["user_input"]
|
|
81
|
+
})
|
|
82
|
+
|
|
83
|
+
# Add the response (use corrected version if available)
|
|
84
|
+
response = log_entry.get("corrected_response") or log_entry["api_response"]
|
|
85
|
+
messages.append({
|
|
86
|
+
"role": "assistant",
|
|
87
|
+
"content": response
|
|
88
|
+
})
|
|
89
|
+
|
|
90
|
+
return {"messages": messages}
|
|
91
|
+
|
|
92
|
+
# Process logs
|
|
93
|
+
with open("api_logs.jsonl") as f, open("training_data.jsonl", "w") as out:
|
|
94
|
+
for line in f:
|
|
95
|
+
log = json.loads(line)
|
|
96
|
+
example = api_log_to_training(log)
|
|
97
|
+
out.write(json.dumps(example) + "\n")
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
### From User Corrections
|
|
101
|
+
|
|
102
|
+
User corrections (edits to model output) are the highest-quality training signal:
|
|
103
|
+
|
|
104
|
+
```python
|
|
105
|
+
def correction_to_training(original_input, corrected_output, system_prompt=None):
|
|
106
|
+
"""Convert a user correction into a training example.
|
|
107
|
+
The corrected output becomes the training target."""
|
|
108
|
+
messages = []
|
|
109
|
+
if system_prompt:
|
|
110
|
+
messages.append({"role": "system", "content": system_prompt})
|
|
111
|
+
messages.append({"role": "user", "content": original_input})
|
|
112
|
+
messages.append({"role": "assistant", "content": corrected_output})
|
|
113
|
+
return {"messages": messages}
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
### From Accept/Reject Signals
|
|
117
|
+
|
|
118
|
+
If users accept or reject model outputs, use accepted outputs as positive examples:
|
|
119
|
+
|
|
120
|
+
```python
|
|
121
|
+
def filter_accepted(logs):
|
|
122
|
+
"""Keep only examples where user accepted the output."""
|
|
123
|
+
accepted = []
|
|
124
|
+
for log in logs:
|
|
125
|
+
if log.get("user_action") == "accepted":
|
|
126
|
+
accepted.append({
|
|
127
|
+
"messages": [
|
|
128
|
+
{"role": "user", "content": log["input"]},
|
|
129
|
+
{"role": "assistant", "content": log["output"]}
|
|
130
|
+
]
|
|
131
|
+
})
|
|
132
|
+
return accepted
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
## Path B: Frontier Distillation
|
|
136
|
+
|
|
137
|
+
Use frontier models to label your production inputs. Best when you have real inputs but no gold labels.
|
|
138
|
+
|
|
139
|
+
### OpenAI Batch API (50% discount)
|
|
140
|
+
|
|
141
|
+
```python
|
|
142
|
+
import json
|
|
143
|
+
|
|
144
|
+
def create_batch_file(inputs, system_prompt, model="gpt-4o"):
|
|
145
|
+
"""Create a batch file for OpenAI Batch API."""
|
|
146
|
+
requests = []
|
|
147
|
+
for i, user_input in enumerate(inputs):
|
|
148
|
+
requests.append({
|
|
149
|
+
"custom_id": f"request-{i}",
|
|
150
|
+
"method": "POST",
|
|
151
|
+
"url": "/v1/chat/completions",
|
|
152
|
+
"body": {
|
|
153
|
+
"model": model,
|
|
154
|
+
"messages": [
|
|
155
|
+
{"role": "system", "content": system_prompt},
|
|
156
|
+
{"role": "user", "content": user_input}
|
|
157
|
+
],
|
|
158
|
+
"max_tokens": 4096
|
|
159
|
+
}
|
|
160
|
+
})
|
|
161
|
+
|
|
162
|
+
with open("batch_input.jsonl", "w") as f:
|
|
163
|
+
for req in requests:
|
|
164
|
+
f.write(json.dumps(req) + "\n")
|
|
165
|
+
return "batch_input.jsonl"
|
|
166
|
+
|
|
167
|
+
# Submit batch
|
|
168
|
+
# openai api batches create -i batch_input.jsonl -e /v1/chat/completions -c 24h
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
### Anthropic Batch API
|
|
172
|
+
|
|
173
|
+
```python
|
|
174
|
+
import anthropic
|
|
175
|
+
|
|
176
|
+
client = anthropic.Anthropic()
|
|
177
|
+
|
|
178
|
+
def create_anthropic_batch(inputs, system_prompt, model="claude-sonnet-4-5-20250929"):
|
|
179
|
+
"""Create batch request for Anthropic Message Batches API."""
|
|
180
|
+
requests = []
|
|
181
|
+
for i, user_input in enumerate(inputs):
|
|
182
|
+
requests.append({
|
|
183
|
+
"custom_id": f"request-{i}",
|
|
184
|
+
"params": {
|
|
185
|
+
"model": model,
|
|
186
|
+
"max_tokens": 4096,
|
|
187
|
+
"system": system_prompt,
|
|
188
|
+
"messages": [
|
|
189
|
+
{"role": "user", "content": user_input}
|
|
190
|
+
]
|
|
191
|
+
}
|
|
192
|
+
})
|
|
193
|
+
|
|
194
|
+
batch = client.messages.batches.create(requests=requests)
|
|
195
|
+
return batch.id
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
### Processing Batch Results
|
|
199
|
+
|
|
200
|
+
```python
|
|
201
|
+
def batch_results_to_training(results_file, inputs, system_prompt=None):
|
|
202
|
+
"""Convert batch API results into training JSONL."""
|
|
203
|
+
training = []
|
|
204
|
+
with open(results_file) as f:
|
|
205
|
+
for line in f:
|
|
206
|
+
result = json.loads(line)
|
|
207
|
+
idx = int(result["custom_id"].split("-")[1])
|
|
208
|
+
messages = []
|
|
209
|
+
if system_prompt:
|
|
210
|
+
messages.append({"role": "system", "content": system_prompt})
|
|
211
|
+
messages.append({"role": "user", "content": inputs[idx]})
|
|
212
|
+
# Extract assistant response from batch result
|
|
213
|
+
content = result["response"]["body"]["choices"][0]["message"]["content"]
|
|
214
|
+
messages.append({"role": "assistant", "content": content})
|
|
215
|
+
training.append({"messages": messages})
|
|
216
|
+
|
|
217
|
+
with open("distilled_training.jsonl", "w") as f:
|
|
218
|
+
for example in training:
|
|
219
|
+
f.write(json.dumps(example) + "\n")
|
|
220
|
+
return len(training)
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
## Path C: Synthetic Bootstrapping
|
|
224
|
+
|
|
225
|
+
Generate training data from scratch when you have < 1000 real examples. Use as a starting point, then replace with production data as it accumulates.
|
|
226
|
+
|
|
227
|
+
### Seed Prompt Strategy
|
|
228
|
+
|
|
229
|
+
```python
|
|
230
|
+
import openai
|
|
231
|
+
|
|
232
|
+
client = openai.OpenAI()
|
|
233
|
+
|
|
234
|
+
def generate_synthetic_examples(task_description, seed_examples, n=500, model="gpt-4o"):
|
|
235
|
+
"""Generate diverse synthetic training examples from seed examples."""
|
|
236
|
+
|
|
237
|
+
meta_prompt = f"""You are generating training data for an LLM that will be fine-tuned for:
|
|
238
|
+
{task_description}
|
|
239
|
+
|
|
240
|
+
Here are {len(seed_examples)} real examples of the desired behavior:
|
|
241
|
+
{json.dumps(seed_examples[:5], indent=2)}
|
|
242
|
+
|
|
243
|
+
Generate a NEW, diverse example. The input should cover a different scenario than
|
|
244
|
+
the seeds. The output should match the quality and style of the examples above.
|
|
245
|
+
|
|
246
|
+
Return JSON: {{"input": "...", "output": "..."}}"""
|
|
247
|
+
|
|
248
|
+
examples = []
|
|
249
|
+
for i in range(n):
|
|
250
|
+
response = client.chat.completions.create(
|
|
251
|
+
model=model,
|
|
252
|
+
messages=[{"role": "user", "content": meta_prompt}],
|
|
253
|
+
response_format={"type": "json_object"},
|
|
254
|
+
temperature=0.9, # High temp for diversity
|
|
255
|
+
)
|
|
256
|
+
example = json.loads(response.choices[0].message.content)
|
|
257
|
+
examples.append({
|
|
258
|
+
"messages": [
|
|
259
|
+
{"role": "user", "content": example["input"]},
|
|
260
|
+
{"role": "assistant", "content": example["output"]}
|
|
261
|
+
]
|
|
262
|
+
})
|
|
263
|
+
return examples
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
### Diversity Strategies
|
|
267
|
+
- Vary temperature (0.7-1.0) across generation batches
|
|
268
|
+
- Use different frontier models (GPT-4o, Claude, Gemini) to reduce model-specific bias
|
|
269
|
+
- Seed with representative prompts from different categories/difficulty levels
|
|
270
|
+
- Include edge cases and adversarial examples explicitly in seed prompts
|
|
271
|
+
|
|
272
|
+
## Data Quality Validation
|
|
273
|
+
|
|
274
|
+
### Schema Validation
|
|
275
|
+
|
|
276
|
+
```python
|
|
277
|
+
def validate_jsonl(filepath):
|
|
278
|
+
"""Validate JSONL training file format."""
|
|
279
|
+
errors = []
|
|
280
|
+
valid = 0
|
|
281
|
+
with open(filepath) as f:
|
|
282
|
+
for i, line in enumerate(f, 1):
|
|
283
|
+
try:
|
|
284
|
+
obj = json.loads(line)
|
|
285
|
+
except json.JSONDecodeError as e:
|
|
286
|
+
errors.append(f"Line {i}: Invalid JSON — {e}")
|
|
287
|
+
continue
|
|
288
|
+
|
|
289
|
+
if "messages" not in obj:
|
|
290
|
+
errors.append(f"Line {i}: Missing 'messages' key")
|
|
291
|
+
continue
|
|
292
|
+
|
|
293
|
+
msgs = obj["messages"]
|
|
294
|
+
if not isinstance(msgs, list) or len(msgs) < 2:
|
|
295
|
+
errors.append(f"Line {i}: 'messages' must be a list with >= 2 entries")
|
|
296
|
+
continue
|
|
297
|
+
|
|
298
|
+
# Check roles
|
|
299
|
+
has_user = any(m.get("role") == "user" for m in msgs)
|
|
300
|
+
has_assistant = any(m.get("role") == "assistant" for m in msgs)
|
|
301
|
+
if not has_user or not has_assistant:
|
|
302
|
+
errors.append(f"Line {i}: Must have at least one user and one assistant message")
|
|
303
|
+
continue
|
|
304
|
+
|
|
305
|
+
for j, msg in enumerate(msgs):
|
|
306
|
+
if "role" not in msg or "content" not in msg:
|
|
307
|
+
errors.append(f"Line {i}, message {j}: Missing 'role' or 'content'")
|
|
308
|
+
elif msg["role"] not in ("system", "user", "assistant"):
|
|
309
|
+
errors.append(f"Line {i}, message {j}: Invalid role '{msg['role']}'")
|
|
310
|
+
|
|
311
|
+
valid += 1
|
|
312
|
+
|
|
313
|
+
return {"valid": valid, "errors": errors, "total": valid + len(errors)}
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
### Deduplication (MinHash)
|
|
317
|
+
|
|
318
|
+
```python
|
|
319
|
+
from datasketch import MinHash, MinHashLSH
|
|
320
|
+
|
|
321
|
+
def deduplicate_dataset(examples, threshold=0.8):
|
|
322
|
+
"""Remove near-duplicate examples using MinHash LSH."""
|
|
323
|
+
lsh = MinHashLSH(threshold=threshold, num_perm=128)
|
|
324
|
+
unique = []
|
|
325
|
+
|
|
326
|
+
for i, ex in enumerate(examples):
|
|
327
|
+
# Hash the assistant's response (training target)
|
|
328
|
+
text = ex["messages"][-1]["content"]
|
|
329
|
+
m = MinHash(num_perm=128)
|
|
330
|
+
for word in text.lower().split():
|
|
331
|
+
m.update(word.encode("utf-8"))
|
|
332
|
+
|
|
333
|
+
key = f"doc-{i}"
|
|
334
|
+
if not lsh.query(m):
|
|
335
|
+
lsh.insert(key, m)
|
|
336
|
+
unique.append(ex)
|
|
337
|
+
|
|
338
|
+
removed = len(examples) - len(unique)
|
|
339
|
+
print(f"Removed {removed} duplicates ({removed/len(examples)*100:.1f}%)")
|
|
340
|
+
return unique
|
|
341
|
+
```
|
|
342
|
+
|
|
343
|
+
### Diversity Metrics
|
|
344
|
+
|
|
345
|
+
```python
|
|
346
|
+
from collections import Counter
|
|
347
|
+
|
|
348
|
+
def distinct_n(texts, n=2):
|
|
349
|
+
"""Calculate distinct-n metric (ratio of unique n-grams to total n-grams)."""
|
|
350
|
+
total_ngrams = Counter()
|
|
351
|
+
for text in texts:
|
|
352
|
+
words = text.lower().split()
|
|
353
|
+
ngrams = [tuple(words[i:i+n]) for i in range(len(words)-n+1)]
|
|
354
|
+
total_ngrams.update(ngrams)
|
|
355
|
+
if sum(total_ngrams.values()) == 0:
|
|
356
|
+
return 0
|
|
357
|
+
return len(total_ngrams) / sum(total_ngrams.values())
|
|
358
|
+
|
|
359
|
+
def dataset_diversity_report(examples):
|
|
360
|
+
"""Generate diversity metrics for a training dataset."""
|
|
361
|
+
responses = [ex["messages"][-1]["content"] for ex in examples]
|
|
362
|
+
inputs = [m["content"] for ex in examples for m in ex["messages"] if m["role"] == "user"]
|
|
363
|
+
|
|
364
|
+
report = {
|
|
365
|
+
"total_examples": len(examples),
|
|
366
|
+
"avg_response_length": sum(len(r.split()) for r in responses) / len(responses),
|
|
367
|
+
"avg_input_length": sum(len(i.split()) for i in inputs) / len(inputs),
|
|
368
|
+
"distinct_1": distinct_n(responses, 1),
|
|
369
|
+
"distinct_2": distinct_n(responses, 2),
|
|
370
|
+
"distinct_3": distinct_n(responses, 3),
|
|
371
|
+
}
|
|
372
|
+
return report
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
## Train/Eval Split
|
|
376
|
+
|
|
377
|
+
```python
|
|
378
|
+
import random
|
|
379
|
+
|
|
380
|
+
def split_dataset(examples, eval_ratio=0.1, production_indices=None):
|
|
381
|
+
"""Split dataset into train/eval, keeping production data in eval for ground truth.
|
|
382
|
+
|
|
383
|
+
Args:
|
|
384
|
+
examples: List of training examples
|
|
385
|
+
eval_ratio: Fraction of data for evaluation (default 10%)
|
|
386
|
+
production_indices: Indices of real production examples (always go to eval)
|
|
387
|
+
"""
|
|
388
|
+
production_indices = set(production_indices or [])
|
|
389
|
+
synthetic = [ex for i, ex in enumerate(examples) if i not in production_indices]
|
|
390
|
+
production = [ex for i, ex in enumerate(examples) if i in production_indices]
|
|
391
|
+
|
|
392
|
+
# Production data goes to eval (ground truth)
|
|
393
|
+
eval_set = list(production)
|
|
394
|
+
|
|
395
|
+
# Fill remaining eval budget from synthetic
|
|
396
|
+
remaining_eval = max(0, int(len(examples) * eval_ratio) - len(eval_set))
|
|
397
|
+
random.shuffle(synthetic)
|
|
398
|
+
eval_set.extend(synthetic[:remaining_eval])
|
|
399
|
+
train_set = synthetic[remaining_eval:]
|
|
400
|
+
|
|
401
|
+
print(f"Train: {len(train_set)}, Eval: {len(eval_set)} "
|
|
402
|
+
f"({len(production)} production + {len(eval_set)-len(production)} synthetic)")
|
|
403
|
+
return train_set, eval_set
|
|
404
|
+
```
|
|
405
|
+
|
|
406
|
+
## Common Issues
|
|
407
|
+
|
|
408
|
+
| Issue | Cause | Fix |
|
|
409
|
+
|-------|-------|-----|
|
|
410
|
+
| `JSONDecodeError` | Trailing commas or malformed JSON | Run `validate_jsonl()` and fix flagged lines |
|
|
411
|
+
| Tokenizer mismatch | Data tokenized for wrong model | Always use target model's tokenizer for length checks |
|
|
412
|
+
| Training loss doesn't decrease | Data too noisy or contradictory | Filter low-quality examples, check for duplicates |
|
|
413
|
+
| Model repeats training data | Overfitting on small dataset | Add more diverse examples, reduce epochs |
|
|
414
|
+
| Data leakage | Eval examples appear in training | Use `split_dataset()` with `production_indices` |
|
|
415
|
+
| Encoding errors | Non-UTF-8 characters | `text.encode('utf-8', errors='replace').decode('utf-8')` |
|
|
416
|
+
| Examples too long | Exceeds model context | Truncate or split long conversations, check tokenizer limits |
|
|
417
|
+
|
|
418
|
+
## Quick Start Checklist
|
|
419
|
+
|
|
420
|
+
1. **Identify data source**: Production logs (A), frontier distillation (B), or synthetic (C)
|
|
421
|
+
2. **Format to JSONL**: Standard chat format with messages array
|
|
422
|
+
3. **Validate**: Run `validate_jsonl()` on the output file
|
|
423
|
+
4. **Deduplicate**: Run MinHash dedup with 0.8 threshold
|
|
424
|
+
5. **Check diversity**: Run `dataset_diversity_report()`, aim for distinct-2 > 0.5
|
|
425
|
+
6. **Split**: 90/10 train/eval, production data in eval set
|
|
426
|
+
7. **Count tokens**: Verify no examples exceed model's context window
|
|
427
|
+
8. **Proceed to training**: Load `tinker` or `unsloth` skill for next step
|