@synsci/cli-darwin-x64-baseline 1.1.73 → 1.1.74

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,238 @@
1
+ ---
2
+ name: model-economics
3
+ description: Cost modeling and ROI analysis for specialized LLM development. Use when deciding whether to train a custom model, estimating total cost, or calculating break-even vs frontier APIs. Covers training costs, inference costs, and time-to-ROI projections.
4
+ version: 1.0.0
5
+ author: Synthetic Sciences
6
+ license: MIT
7
+ tags: [Economics, Cost Analysis, ROI, Break-Even, Training Cost, Inference Cost, Model Flywheel, Business Case]
8
+ dependencies: []
9
+ ---
10
+
11
+ # Model Economics: Cost Analysis for LLM Specialization
12
+
13
+ ## When to Use This Skill
14
+
15
+ Use this skill as the **first step** of any flywheel assessment to determine:
16
+ - Is it worth training a specialized model?
17
+ - What will the total investment be?
18
+ - When will the specialized model break even vs frontier APIs?
19
+ - What's the ongoing cost comparison?
20
+
21
+ ## Frontier API Cost Model
22
+
23
+ **Important**: Use `websearch` to get current pricing — these change frequently. Reference tiers as of early 2026:
24
+
25
+ ### Premium Tier ($5-25/M output tokens)
26
+ | Model | Input ($/M) | Output ($/M) | Notes |
27
+ |-------|-------------|--------------|-------|
28
+ | Claude Opus 4.6 | $5.00 | $25.00 | Highest capability |
29
+ | GPT-5.2 Pro | $21.00 | $168.00 | Reasoning-heavy |
30
+
31
+ **Savings potential**: Highest. A specialized 8B model costs ~$0.05-0.15/M tokens to serve = 100-500x cheaper.
32
+
33
+ ### Standard Tier ($1.75-14/M output tokens)
34
+ | Model | Input ($/M) | Output ($/M) | Notes |
35
+ |-------|-------------|--------------|-------|
36
+ | GPT-5.2 Chat | $1.75 | $14.00 | Most popular |
37
+ | Claude Sonnet 4.5 | $3.00 | $15.00 | Good balance |
38
+ | Qwen3 Max | $1.20 | $6.00 | Cost-effective |
39
+
40
+ **Savings potential**: Strong at volume. Break-even is achievable with moderate traffic.
41
+
42
+ ### Budget Tier ($0.10-3/M output tokens)
43
+ | Model | Input ($/M) | Output ($/M) | Notes |
44
+ |-------|-------------|--------------|-------|
45
+ | Gemini 3 Flash | $0.50 | $3.00 | Fast, cheap |
46
+ | GLM 4.7 | $0.40 | $1.50 | Very cost-effective |
47
+ | Mistral Small | $0.10 | $0.30 | Cheapest commercial |
48
+
49
+ **Savings potential**: Harder to beat on cost alone. Specialize for quality, not cost.
50
+
51
+ ### Free / Near-Free
52
+ Many open-weight models available on OpenRouter at $0. If you only need basic capability, self-hosting these is essentially free beyond GPU cost.
53
+
54
+ ## Training Cost Estimation
55
+
56
+ ### Tinker (Managed LoRA)
57
+ Cheapest option for supported models. No GPU management.
58
+
59
+ | Model Size | Method | Cost/1K examples | Duration |
60
+ |------------|--------|-------------------|----------|
61
+ | 8B | LoRA | ~$2-5 | 15-30 min |
62
+ | 14B | LoRA | ~$5-10 | 30-60 min |
63
+ | 32B | LoRA | ~$10-25 | 1-3 hours |
64
+ | 70B | LoRA | ~$25-60 | 3-8 hours |
65
+
66
+ Load `tinker-training-cost` skill for exact per-token rates.
67
+
68
+ ### Modal (Serverless GPU)
69
+
70
+ | GPU | Cost/Hour | Best For |
71
+ |-----|-----------|----------|
72
+ | A100-40GB | ~$1.10 | 7-14B LoRA, 7B full |
73
+ | A100-80GB | ~$1.60 | 14-32B LoRA, 14B full |
74
+ | H100-80GB | ~$3.25 | 32-70B LoRA, large batches |
75
+
76
+ **Typical training runs**:
77
+ - 8B LoRA, 5K examples, 3 epochs: ~1-2 hours on A100 = $1-3
78
+ - 14B LoRA, 10K examples, 3 epochs: ~3-5 hours on A100-80GB = $5-8
79
+ - 70B LoRA, 10K examples, 2 epochs: ~8-12 hours on H100 = $26-39
80
+
81
+ ### TensorPool (Dedicated GPU)
82
+
83
+ | GPU | Cost/Hour | Best For |
84
+ |-----|-----------|----------|
85
+ | A100-80GB | ~$1.50 | Standard training |
86
+ | H200-141GB | ~$3.00 | Large models, fast training |
87
+ | B200-192GB | ~$4.50 | Cutting-edge, maximum throughput |
88
+
89
+ Best for: Multi-day training, large-scale RL, dedicated inference.
90
+
91
+ ## Inference Cost Comparison
92
+
93
+ The key calculation: what does it cost to serve the specialized model vs the frontier API?
94
+
95
+ ### Self-Hosted Inference Cost
96
+
97
+ ```python
98
+ def inference_cost_per_million_tokens(gpu_cost_per_hour, tokens_per_second):
99
+ """Calculate cost per million tokens for self-hosted inference.
100
+
101
+ Args:
102
+ gpu_cost_per_hour: GPU rental cost (e.g., $3.25 for H100)
103
+ tokens_per_second: Model throughput (varies by model size and batch)
104
+ """
105
+ tokens_per_hour = tokens_per_second * 3600
106
+ cost_per_token = gpu_cost_per_hour / tokens_per_hour
107
+ cost_per_million = cost_per_token * 1_000_000
108
+ return cost_per_million
109
+ ```
110
+
111
+ ### Reference Throughput (vLLM, single GPU)
112
+
113
+ | Model Size | GPU | Tokens/sec (batch) | Cost/M tokens |
114
+ |------------|-----|-------------------|---------------|
115
+ | 8B | A100-80GB | ~3,000-5,000 | $0.09-0.15 |
116
+ | 8B | H100-80GB | ~5,000-8,000 | $0.11-0.18 |
117
+ | 14B | A100-80GB | ~1,500-3,000 | $0.15-0.30 |
118
+ | 14B | H100-80GB | ~3,000-5,000 | $0.18-0.30 |
119
+ | 32B | H100-80GB | ~800-1,500 | $0.60-1.10 |
120
+ | 70B | 2xH100 | ~500-1,000 | $1.80-3.60 |
121
+
122
+ **Key insight**: A specialized 8B model on Modal/TensorPool costs **$0.05-0.15/M tokens** vs **$5-25/M** for premium frontier = **50-500x cheaper per request**.
123
+
124
+ ## Break-Even Calculator
125
+
126
+ ```python
127
+ def break_even_analysis(
128
+ monthly_api_spend: float,
129
+ training_cost: float,
130
+ monthly_inference_cost: float,
131
+ setup_hours: float = 40,
132
+ hourly_rate: float = 100,
133
+ ):
134
+ """Calculate break-even timeline for model specialization.
135
+
136
+ Args:
137
+ monthly_api_spend: Current monthly frontier API cost
138
+ training_cost: One-time training cost (GPU + data prep)
139
+ monthly_inference_cost: Monthly cost to serve specialized model
140
+ setup_hours: Engineering time to set up pipeline
141
+ hourly_rate: Engineering cost per hour
142
+ """
143
+ engineering_cost = setup_hours * hourly_rate
144
+ total_upfront = training_cost + engineering_cost
145
+ monthly_savings = monthly_api_spend - monthly_inference_cost
146
+
147
+ if monthly_savings <= 0:
148
+ return {
149
+ "viable": False,
150
+ "reason": "Specialized model costs more to serve than frontier API",
151
+ "monthly_savings": monthly_savings,
152
+ }
153
+
154
+ months_to_break_even = total_upfront / monthly_savings
155
+
156
+ return {
157
+ "viable": True,
158
+ "total_upfront_cost": total_upfront,
159
+ "training_cost": training_cost,
160
+ "engineering_cost": engineering_cost,
161
+ "monthly_api_spend": monthly_api_spend,
162
+ "monthly_inference_cost": monthly_inference_cost,
163
+ "monthly_savings": monthly_savings,
164
+ "months_to_break_even": round(months_to_break_even, 1),
165
+ "annual_savings": monthly_savings * 12 - total_upfront,
166
+ "year_2_savings": monthly_savings * 12, # Fully amortized
167
+ }
168
+
169
+
170
+ # Example: Replace GPT-4o for a specific task
171
+ result = break_even_analysis(
172
+ monthly_api_spend=5000, # $5K/month on GPT-4o
173
+ training_cost=500, # $500 for LoRA fine-tune + data prep
174
+ monthly_inference_cost=200, # $200/month on Modal (8B model)
175
+ setup_hours=40, # 1 week of engineering
176
+ hourly_rate=100, # $100/hr engineer
177
+ )
178
+ # Result: Break-even in ~1.1 months, $52K annual savings after year 1
179
+ ```
180
+
181
+ ## ROI Timeline Template
182
+
183
+ | Month | Frontier Cost | Specialized Cost | Cumulative Savings | Notes |
184
+ |-------|--------------|-----------------|-------------------|-------|
185
+ | 0 | — | $4,500 upfront | -$4,500 | Training + engineering |
186
+ | 1 | $5,000 | $200 | $300 | First month live |
187
+ | 2 | $5,000 | $200 | $5,100 | Break-even |
188
+ | 3 | $5,000 | $200 | $9,900 | |
189
+ | 6 | $5,000 | $200 | $24,300 | |
190
+ | 12 | $5,000 | $300 | $52,200 | Includes retrain cost |
191
+
192
+ ## Decision Framework
193
+
194
+ ### Strong Case for Specialization
195
+ - Monthly API spend > $1,000
196
+ - Constrained task (classification, extraction, formatting, domain Q&A)
197
+ - Production data available (logs, corrections, accept/reject signals)
198
+ - Quality requirements are well-defined and measurable
199
+ - Task doesn't change frequently
200
+
201
+ ### Weak Case for Specialization
202
+ - Monthly API spend < $500
203
+ - Diverse, open-ended tasks (general assistant)
204
+ - No production data to train on
205
+ - Rapid model improvement expected (new frontier models releasing soon)
206
+ - Task requires reasoning at the frontier level
207
+ - Small team with no ML engineering capacity
208
+
209
+ ### Hybrid Approach
210
+ Route easy requests to specialized model, hard ones to frontier:
211
+ - **90% of requests** → Specialized 8B (cheap, fast)
212
+ - **10% of requests** → Frontier fallback (complex, novel)
213
+ - **Result**: 80-90% cost reduction while maintaining quality on hard cases
214
+
215
+ ## Quick Assessment Questions
216
+
217
+ Ask the user these to determine viability:
218
+
219
+ 1. **What's your monthly frontier API spend?**
220
+ - < $500: Probably not worth it (low ROI)
221
+ - $500-2K: Worth investigating
222
+ - $2K+: Strong candidate
223
+ - $10K+: Almost certainly worth it
224
+
225
+ 2. **How constrained is the task?**
226
+ - Single task (classification, extraction): Great fit
227
+ - Few related tasks: Good fit
228
+ - Many diverse tasks: Harder to specialize
229
+
230
+ 3. **What production data do you have?**
231
+ - API logs with user feedback: Excellent
232
+ - API logs without feedback: Good (can distill)
233
+ - No production data: Need synthetic bootstrap
234
+
235
+ 4. **What's your quality bar?**
236
+ - Must match frontier exactly: Harder, may need bigger model
237
+ - 90% of frontier quality is fine: 8B LoRA likely sufficient
238
+ - Task-specific metrics (accuracy, format): Easiest to optimize
@@ -0,0 +1,427 @@
1
+ ---
2
+ name: training-data-pipeline
3
+ description: Build training datasets for LLM specialization from production data, frontier model distillation, and synthetic bootstrapping. Use when formatting production logs into SFT data, distilling from frontier APIs, or preparing data for fine-tuning. Covers JSONL formatting, data quality validation, deduplication, and train/eval splitting.
4
+ version: 1.0.0
5
+ author: Synthetic Sciences
6
+ license: MIT
7
+ tags: [Training Data, Data Pipeline, Fine-Tuning, Distillation, Synthetic Data, Production Data, JSONL, Data Quality]
8
+ dependencies: [datasets, transformers, openai]
9
+ ---
10
+
11
+ # Training Data Pipeline
12
+
13
+ ## When to Use This Skill
14
+
15
+ Use this skill when you need to:
16
+ - **Format production logs** into SFT training data (API logs, user corrections, accept/reject signals)
17
+ - **Distill from frontier models** using batch APIs (OpenAI, Anthropic) to label production inputs
18
+ - **Bootstrap synthetic data** when fewer than 1000 real examples exist
19
+ - **Validate data quality** before training (dedup, schema check, diversity metrics)
20
+ - **Split data** into train/eval sets with production data reserved for evaluation
21
+
22
+ ### Three Data Paths
23
+
24
+ | Path | When to Use | Data Source | Cost |
25
+ |------|-------------|-------------|------|
26
+ | A) Production data | Have API logs or user feedback | Your own production systems | Free (already collected) |
27
+ | B) Frontier distillation | Have production inputs but no labels | OpenAI/Anthropic batch APIs | ~50% of real-time API cost |
28
+ | C) Synthetic bootstrap | < 1000 real examples | Frontier model generation | Varies by volume |
29
+
30
+ **Always prefer Path A** — production data is the moat competitors can't replicate.
31
+
32
+ ## JSONL Chat Format
33
+
34
+ All training platforms (Tinker, Unsloth, TRL, Axolotl) accept this standard chat format:
35
+
36
+ ```jsonl
37
+ {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "4"}]}
38
+ {"messages": [{"role": "user", "content": "Translate to French: Hello"}, {"role": "assistant", "content": "Bonjour"}]}
39
+ ```
40
+
41
+ ### Format Rules
42
+ - One JSON object per line, no trailing commas
43
+ - `messages` array with `role` and `content` fields
44
+ - Roles: `system` (optional, first only), `user`, `assistant` (alternating)
45
+ - Multi-turn: alternate user/assistant pairs within a single messages array
46
+ - UTF-8 encoding, no BOM
47
+ - `assistant` messages are the training targets — everything else is context
48
+
49
+ ### Platform-Specific Notes
50
+
51
+ **Tinker**: Standard chat format above. Max 32K tokens per example. System message optional.
52
+
53
+ **Unsloth/TRL**: Same format. Also accepts `{"prompt": "...", "completion": "..."}` for simple pairs. Chat format preferred for multi-turn.
54
+
55
+ **Axolotl**: Supports multiple formats via config. Recommend `chat_template` type with standard JSONL.
56
+
57
+ ## Path A: Production Data Collection
58
+
59
+ ### From API Logs
60
+
61
+ If you log API requests/responses, convert them directly:
62
+
63
+ ```python
64
+ import json
65
+
66
+ def api_log_to_training(log_entry):
67
+ """Convert an API request/response log to training format."""
68
+ messages = []
69
+
70
+ # Add system prompt if present
71
+ if log_entry.get("system_prompt"):
72
+ messages.append({
73
+ "role": "system",
74
+ "content": log_entry["system_prompt"]
75
+ })
76
+
77
+ # Add the user's input
78
+ messages.append({
79
+ "role": "user",
80
+ "content": log_entry["user_input"]
81
+ })
82
+
83
+ # Add the response (use corrected version if available)
84
+ response = log_entry.get("corrected_response") or log_entry["api_response"]
85
+ messages.append({
86
+ "role": "assistant",
87
+ "content": response
88
+ })
89
+
90
+ return {"messages": messages}
91
+
92
+ # Process logs
93
+ with open("api_logs.jsonl") as f, open("training_data.jsonl", "w") as out:
94
+ for line in f:
95
+ log = json.loads(line)
96
+ example = api_log_to_training(log)
97
+ out.write(json.dumps(example) + "\n")
98
+ ```
99
+
100
+ ### From User Corrections
101
+
102
+ User corrections (edits to model output) are the highest-quality training signal:
103
+
104
+ ```python
105
+ def correction_to_training(original_input, corrected_output, system_prompt=None):
106
+ """Convert a user correction into a training example.
107
+ The corrected output becomes the training target."""
108
+ messages = []
109
+ if system_prompt:
110
+ messages.append({"role": "system", "content": system_prompt})
111
+ messages.append({"role": "user", "content": original_input})
112
+ messages.append({"role": "assistant", "content": corrected_output})
113
+ return {"messages": messages}
114
+ ```
115
+
116
+ ### From Accept/Reject Signals
117
+
118
+ If users accept or reject model outputs, use accepted outputs as positive examples:
119
+
120
+ ```python
121
+ def filter_accepted(logs):
122
+ """Keep only examples where user accepted the output."""
123
+ accepted = []
124
+ for log in logs:
125
+ if log.get("user_action") == "accepted":
126
+ accepted.append({
127
+ "messages": [
128
+ {"role": "user", "content": log["input"]},
129
+ {"role": "assistant", "content": log["output"]}
130
+ ]
131
+ })
132
+ return accepted
133
+ ```
134
+
135
+ ## Path B: Frontier Distillation
136
+
137
+ Use frontier models to label your production inputs. Best when you have real inputs but no gold labels.
138
+
139
+ ### OpenAI Batch API (50% discount)
140
+
141
+ ```python
142
+ import json
143
+
144
+ def create_batch_file(inputs, system_prompt, model="gpt-4o"):
145
+ """Create a batch file for OpenAI Batch API."""
146
+ requests = []
147
+ for i, user_input in enumerate(inputs):
148
+ requests.append({
149
+ "custom_id": f"request-{i}",
150
+ "method": "POST",
151
+ "url": "/v1/chat/completions",
152
+ "body": {
153
+ "model": model,
154
+ "messages": [
155
+ {"role": "system", "content": system_prompt},
156
+ {"role": "user", "content": user_input}
157
+ ],
158
+ "max_tokens": 4096
159
+ }
160
+ })
161
+
162
+ with open("batch_input.jsonl", "w") as f:
163
+ for req in requests:
164
+ f.write(json.dumps(req) + "\n")
165
+ return "batch_input.jsonl"
166
+
167
+ # Submit batch
168
+ # openai api batches create -i batch_input.jsonl -e /v1/chat/completions -c 24h
169
+ ```
170
+
171
+ ### Anthropic Batch API
172
+
173
+ ```python
174
+ import anthropic
175
+
176
+ client = anthropic.Anthropic()
177
+
178
+ def create_anthropic_batch(inputs, system_prompt, model="claude-sonnet-4-5-20250929"):
179
+ """Create batch request for Anthropic Message Batches API."""
180
+ requests = []
181
+ for i, user_input in enumerate(inputs):
182
+ requests.append({
183
+ "custom_id": f"request-{i}",
184
+ "params": {
185
+ "model": model,
186
+ "max_tokens": 4096,
187
+ "system": system_prompt,
188
+ "messages": [
189
+ {"role": "user", "content": user_input}
190
+ ]
191
+ }
192
+ })
193
+
194
+ batch = client.messages.batches.create(requests=requests)
195
+ return batch.id
196
+ ```
197
+
198
+ ### Processing Batch Results
199
+
200
+ ```python
201
+ def batch_results_to_training(results_file, inputs, system_prompt=None):
202
+ """Convert batch API results into training JSONL."""
203
+ training = []
204
+ with open(results_file) as f:
205
+ for line in f:
206
+ result = json.loads(line)
207
+ idx = int(result["custom_id"].split("-")[1])
208
+ messages = []
209
+ if system_prompt:
210
+ messages.append({"role": "system", "content": system_prompt})
211
+ messages.append({"role": "user", "content": inputs[idx]})
212
+ # Extract assistant response from batch result
213
+ content = result["response"]["body"]["choices"][0]["message"]["content"]
214
+ messages.append({"role": "assistant", "content": content})
215
+ training.append({"messages": messages})
216
+
217
+ with open("distilled_training.jsonl", "w") as f:
218
+ for example in training:
219
+ f.write(json.dumps(example) + "\n")
220
+ return len(training)
221
+ ```
222
+
223
+ ## Path C: Synthetic Bootstrapping
224
+
225
+ Generate training data from scratch when you have < 1000 real examples. Use as a starting point, then replace with production data as it accumulates.
226
+
227
+ ### Seed Prompt Strategy
228
+
229
+ ```python
230
+ import openai
231
+
232
+ client = openai.OpenAI()
233
+
234
+ def generate_synthetic_examples(task_description, seed_examples, n=500, model="gpt-4o"):
235
+ """Generate diverse synthetic training examples from seed examples."""
236
+
237
+ meta_prompt = f"""You are generating training data for an LLM that will be fine-tuned for:
238
+ {task_description}
239
+
240
+ Here are {len(seed_examples)} real examples of the desired behavior:
241
+ {json.dumps(seed_examples[:5], indent=2)}
242
+
243
+ Generate a NEW, diverse example. The input should cover a different scenario than
244
+ the seeds. The output should match the quality and style of the examples above.
245
+
246
+ Return JSON: {{"input": "...", "output": "..."}}"""
247
+
248
+ examples = []
249
+ for i in range(n):
250
+ response = client.chat.completions.create(
251
+ model=model,
252
+ messages=[{"role": "user", "content": meta_prompt}],
253
+ response_format={"type": "json_object"},
254
+ temperature=0.9, # High temp for diversity
255
+ )
256
+ example = json.loads(response.choices[0].message.content)
257
+ examples.append({
258
+ "messages": [
259
+ {"role": "user", "content": example["input"]},
260
+ {"role": "assistant", "content": example["output"]}
261
+ ]
262
+ })
263
+ return examples
264
+ ```
265
+
266
+ ### Diversity Strategies
267
+ - Vary temperature (0.7-1.0) across generation batches
268
+ - Use different frontier models (GPT-4o, Claude, Gemini) to reduce model-specific bias
269
+ - Seed with representative prompts from different categories/difficulty levels
270
+ - Include edge cases and adversarial examples explicitly in seed prompts
271
+
272
+ ## Data Quality Validation
273
+
274
+ ### Schema Validation
275
+
276
+ ```python
277
+ def validate_jsonl(filepath):
278
+ """Validate JSONL training file format."""
279
+ errors = []
280
+ valid = 0
281
+ with open(filepath) as f:
282
+ for i, line in enumerate(f, 1):
283
+ try:
284
+ obj = json.loads(line)
285
+ except json.JSONDecodeError as e:
286
+ errors.append(f"Line {i}: Invalid JSON — {e}")
287
+ continue
288
+
289
+ if "messages" not in obj:
290
+ errors.append(f"Line {i}: Missing 'messages' key")
291
+ continue
292
+
293
+ msgs = obj["messages"]
294
+ if not isinstance(msgs, list) or len(msgs) < 2:
295
+ errors.append(f"Line {i}: 'messages' must be a list with >= 2 entries")
296
+ continue
297
+
298
+ # Check roles
299
+ has_user = any(m.get("role") == "user" for m in msgs)
300
+ has_assistant = any(m.get("role") == "assistant" for m in msgs)
301
+ if not has_user or not has_assistant:
302
+ errors.append(f"Line {i}: Must have at least one user and one assistant message")
303
+ continue
304
+
305
+ for j, msg in enumerate(msgs):
306
+ if "role" not in msg or "content" not in msg:
307
+ errors.append(f"Line {i}, message {j}: Missing 'role' or 'content'")
308
+ elif msg["role"] not in ("system", "user", "assistant"):
309
+ errors.append(f"Line {i}, message {j}: Invalid role '{msg['role']}'")
310
+
311
+ valid += 1
312
+
313
+ return {"valid": valid, "errors": errors, "total": valid + len(errors)}
314
+ ```
315
+
316
+ ### Deduplication (MinHash)
317
+
318
+ ```python
319
+ from datasketch import MinHash, MinHashLSH
320
+
321
+ def deduplicate_dataset(examples, threshold=0.8):
322
+ """Remove near-duplicate examples using MinHash LSH."""
323
+ lsh = MinHashLSH(threshold=threshold, num_perm=128)
324
+ unique = []
325
+
326
+ for i, ex in enumerate(examples):
327
+ # Hash the assistant's response (training target)
328
+ text = ex["messages"][-1]["content"]
329
+ m = MinHash(num_perm=128)
330
+ for word in text.lower().split():
331
+ m.update(word.encode("utf-8"))
332
+
333
+ key = f"doc-{i}"
334
+ if not lsh.query(m):
335
+ lsh.insert(key, m)
336
+ unique.append(ex)
337
+
338
+ removed = len(examples) - len(unique)
339
+ print(f"Removed {removed} duplicates ({removed/len(examples)*100:.1f}%)")
340
+ return unique
341
+ ```
342
+
343
+ ### Diversity Metrics
344
+
345
+ ```python
346
+ from collections import Counter
347
+
348
+ def distinct_n(texts, n=2):
349
+ """Calculate distinct-n metric (ratio of unique n-grams to total n-grams)."""
350
+ total_ngrams = Counter()
351
+ for text in texts:
352
+ words = text.lower().split()
353
+ ngrams = [tuple(words[i:i+n]) for i in range(len(words)-n+1)]
354
+ total_ngrams.update(ngrams)
355
+ if sum(total_ngrams.values()) == 0:
356
+ return 0
357
+ return len(total_ngrams) / sum(total_ngrams.values())
358
+
359
+ def dataset_diversity_report(examples):
360
+ """Generate diversity metrics for a training dataset."""
361
+ responses = [ex["messages"][-1]["content"] for ex in examples]
362
+ inputs = [m["content"] for ex in examples for m in ex["messages"] if m["role"] == "user"]
363
+
364
+ report = {
365
+ "total_examples": len(examples),
366
+ "avg_response_length": sum(len(r.split()) for r in responses) / len(responses),
367
+ "avg_input_length": sum(len(i.split()) for i in inputs) / len(inputs),
368
+ "distinct_1": distinct_n(responses, 1),
369
+ "distinct_2": distinct_n(responses, 2),
370
+ "distinct_3": distinct_n(responses, 3),
371
+ }
372
+ return report
373
+ ```
374
+
375
+ ## Train/Eval Split
376
+
377
+ ```python
378
+ import random
379
+
380
+ def split_dataset(examples, eval_ratio=0.1, production_indices=None):
381
+ """Split dataset into train/eval, keeping production data in eval for ground truth.
382
+
383
+ Args:
384
+ examples: List of training examples
385
+ eval_ratio: Fraction of data for evaluation (default 10%)
386
+ production_indices: Indices of real production examples (always go to eval)
387
+ """
388
+ production_indices = set(production_indices or [])
389
+ synthetic = [ex for i, ex in enumerate(examples) if i not in production_indices]
390
+ production = [ex for i, ex in enumerate(examples) if i in production_indices]
391
+
392
+ # Production data goes to eval (ground truth)
393
+ eval_set = list(production)
394
+
395
+ # Fill remaining eval budget from synthetic
396
+ remaining_eval = max(0, int(len(examples) * eval_ratio) - len(eval_set))
397
+ random.shuffle(synthetic)
398
+ eval_set.extend(synthetic[:remaining_eval])
399
+ train_set = synthetic[remaining_eval:]
400
+
401
+ print(f"Train: {len(train_set)}, Eval: {len(eval_set)} "
402
+ f"({len(production)} production + {len(eval_set)-len(production)} synthetic)")
403
+ return train_set, eval_set
404
+ ```
405
+
406
+ ## Common Issues
407
+
408
+ | Issue | Cause | Fix |
409
+ |-------|-------|-----|
410
+ | `JSONDecodeError` | Trailing commas or malformed JSON | Run `validate_jsonl()` and fix flagged lines |
411
+ | Tokenizer mismatch | Data tokenized for wrong model | Always use target model's tokenizer for length checks |
412
+ | Training loss doesn't decrease | Data too noisy or contradictory | Filter low-quality examples, check for duplicates |
413
+ | Model repeats training data | Overfitting on small dataset | Add more diverse examples, reduce epochs |
414
+ | Data leakage | Eval examples appear in training | Use `split_dataset()` with `production_indices` |
415
+ | Encoding errors | Non-UTF-8 characters | `text.encode('utf-8', errors='replace').decode('utf-8')` |
416
+ | Examples too long | Exceeds model context | Truncate or split long conversations, check tokenizer limits |
417
+
418
+ ## Quick Start Checklist
419
+
420
+ 1. **Identify data source**: Production logs (A), frontier distillation (B), or synthetic (C)
421
+ 2. **Format to JSONL**: Standard chat format with messages array
422
+ 3. **Validate**: Run `validate_jsonl()` on the output file
423
+ 4. **Deduplicate**: Run MinHash dedup with 0.8 threshold
424
+ 5. **Check diversity**: Run `dataset_diversity_report()`, aim for distinct-2 > 0.5
425
+ 6. **Split**: 90/10 train/eval, production data in eval set
426
+ 7. **Count tokens**: Verify no examples exceed model's context window
427
+ 8. **Proceed to training**: Load `tinker` or `unsloth` skill for next step