@synsci/cli-darwin-arm64 1.1.73 → 1.1.75

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,169 @@
1
+ # Scoring Rubrics Reference
2
+
3
+ ## Overview
4
+
5
+ A good rubric is the difference between noisy and reliable LLM-as-judge evaluation.
6
+ This reference provides rubric templates for common evaluation scenarios.
7
+
8
+ ## Rubric Design Principles
9
+
10
+ 1. **Specific anchors**: Each score level must describe observable behavior, not vague quality
11
+ 2. **Independent dimensions**: Criteria should not overlap (avoid "quality" and "helpfulness")
12
+ 3. **Weighted dimensions**: Not all criteria matter equally — assign weights
13
+ 4. **Calibration examples**: Include 2-3 example responses with their expected scores
14
+ 5. **Task-aligned**: The rubric should match what your users actually care about
15
+
16
+ ## Template: General Quality
17
+
18
+ ```
19
+ Dimensions (all weighted equally unless specified):
20
+
21
+ 1. Accuracy
22
+ 1: Contains factual errors or hallucinations
23
+ 2: Mostly correct but with notable inaccuracies
24
+ 3: Factually correct on main points, minor issues
25
+ 4: Accurate and well-supported claims
26
+ 5: Perfectly accurate with appropriate caveats
27
+
28
+ 2. Relevance
29
+ 1: Does not address the user's question
30
+ 2: Partially relevant, misses key aspects
31
+ 3: Addresses the main question adequately
32
+ 4: Comprehensive coverage of the topic
33
+ 5: Precisely addresses every aspect of the question
34
+
35
+ 3. Clarity
36
+ 1: Confusing, poorly organized
37
+ 2: Understandable but hard to follow
38
+ 3: Clear and logically organized
39
+ 4: Well-structured with good flow
40
+ 5: Exceptionally clear, easy to scan and understand
41
+
42
+ 4. Conciseness
43
+ 1: Extremely verbose, buries the answer
44
+ 2: Contains significant unnecessary content
45
+ 3: Appropriate length for the question
46
+ 4: Efficiently communicated
47
+ 5: Optimal length — every word serves a purpose
48
+ ```
49
+
50
+ ## Template: Code Generation
51
+
52
+ ```
53
+ Dimensions:
54
+
55
+ 1. Correctness (weight: 0.40)
56
+ 1: Won't compile/run, fundamental logic errors
57
+ 2: Runs but fails on basic test cases
58
+ 3: Handles common cases correctly
59
+ 4: Handles edge cases, good error handling
60
+ 5: Correct, robust, handles all specified requirements
61
+
62
+ 2. Code Quality (weight: 0.25)
63
+ 1: Unreadable, no structure
64
+ 2: Poor naming, minimal structure
65
+ 3: Acceptable style, reasonable naming
66
+ 4: Clean, well-organized, follows conventions
67
+ 5: Exemplary code that teaches best practices
68
+
69
+ 3. Efficiency (weight: 0.15)
70
+ 1: Exponential complexity or worse
71
+ 2: Unnecessarily slow (wrong algorithm choice)
72
+ 3: Acceptable for typical input sizes
73
+ 4: Well-optimized, appropriate algorithms
74
+ 5: Optimal or near-optimal solution
75
+
76
+ 4. Completeness (weight: 0.20)
77
+ 1: Missing major requirements
78
+ 2: Partial implementation, key gaps
79
+ 3: Core requirements met
80
+ 4: Complete with error handling
81
+ 5: Complete with tests, docs, and error handling
82
+ ```
83
+
84
+ ## Template: Customer Support
85
+
86
+ ```
87
+ Dimensions:
88
+
89
+ 1. Problem Resolution (weight: 0.40)
90
+ 1: Does not address the customer's issue
91
+ 2: Acknowledges issue but provides wrong solution
92
+ 3: Provides a valid solution that may not be optimal
93
+ 4: Provides the best available solution
94
+ 5: Resolves issue and proactively prevents related problems
95
+
96
+ 2. Tone & Empathy (weight: 0.25)
97
+ 1: Rude, dismissive, or robotic
98
+ 2: Professional but cold
99
+ 3: Friendly and professional
100
+ 4: Warm, empathetic, personalized
101
+ 5: Exceptional rapport while maintaining professionalism
102
+
103
+ 3. Accuracy (weight: 0.20)
104
+ 1: Contains incorrect information about products/policies
105
+ 2: Mostly correct with some errors
106
+ 3: Factually accurate
107
+ 4: Accurate with helpful additional context
108
+ 5: Perfectly accurate with relevant links/resources
109
+
110
+ 4. Efficiency (weight: 0.15)
111
+ 1: Requires multiple follow-ups for basic resolution
112
+ 2: Could be more direct
113
+ 3: Reasonable number of steps to resolution
114
+ 4: Efficient resolution path
115
+ 5: Resolves in minimum possible interactions
116
+ ```
117
+
118
+ ## Template: Summarization
119
+
120
+ ```
121
+ Dimensions:
122
+
123
+ 1. Faithfulness (weight: 0.35)
124
+ 1: Contains hallucinated information not in source
125
+ 2: Mostly faithful but adds unsupported claims
126
+ 3: Faithful to source material
127
+ 4: Accurately represents source with proper nuance
128
+ 5: Perfectly faithful, captures nuance and caveats
129
+
130
+ 2. Coverage (weight: 0.30)
131
+ 1: Misses most key points
132
+ 2: Captures some key points, misses important ones
133
+ 3: Covers main points adequately
134
+ 4: Comprehensive coverage of key information
135
+ 5: Captures all important points and relationships
136
+
137
+ 3. Coherence (weight: 0.20)
138
+ 1: Disjointed, hard to follow
139
+ 2: Some logical flow issues
140
+ 3: Reads smoothly
141
+ 4: Well-organized with clear structure
142
+ 5: Exemplary narrative flow
143
+
144
+ 4. Conciseness (weight: 0.15)
145
+ 1: As long as original (no compression)
146
+ 2: Minimal compression, includes unnecessary details
147
+ 3: Reasonable length reduction
148
+ 4: Well-compressed, only essential information
149
+ 5: Maximum information density, every word counts
150
+ ```
151
+
152
+ ## Composite Scoring
153
+
154
+ ```python
155
+ def weighted_score(scores, weights):
156
+ """Calculate weighted composite score from dimension scores.
157
+
158
+ Args:
159
+ scores: dict of {"dimension": score} (1-5)
160
+ weights: dict of {"dimension": weight} (sums to 1.0)
161
+ """
162
+ total = sum(scores[dim] * weights[dim] for dim in scores)
163
+ return round(total, 2)
164
+
165
+ # Example
166
+ scores = {"correctness": 4, "quality": 3, "efficiency": 5, "completeness": 4}
167
+ weights = {"correctness": 0.4, "quality": 0.25, "efficiency": 0.15, "completeness": 0.2}
168
+ composite = weighted_score(scores, weights) # 3.85
169
+ ```
@@ -0,0 +1,238 @@
1
+ ---
2
+ name: model-economics
3
+ description: Cost modeling and ROI analysis for specialized LLM development. Use when deciding whether to train a custom model, estimating total cost, or calculating break-even vs frontier APIs. Covers training costs, inference costs, and time-to-ROI projections.
4
+ version: 1.0.0
5
+ author: Synthetic Sciences
6
+ license: MIT
7
+ tags: [Economics, Cost Analysis, ROI, Break-Even, Training Cost, Inference Cost, Model Flywheel, Business Case]
8
+ dependencies: []
9
+ ---
10
+
11
+ # Model Economics: Cost Analysis for LLM Specialization
12
+
13
+ ## When to Use This Skill
14
+
15
+ Use this skill as the **first step** of any flywheel assessment to determine:
16
+ - Is it worth training a specialized model?
17
+ - What will the total investment be?
18
+ - When will the specialized model break even vs frontier APIs?
19
+ - What's the ongoing cost comparison?
20
+
21
+ ## Frontier API Cost Model
22
+
23
+ **Important**: Use `websearch` to get current pricing — these change frequently. Reference tiers as of early 2026:
24
+
25
+ ### Premium Tier ($5-25/M output tokens)
26
+ | Model | Input ($/M) | Output ($/M) | Notes |
27
+ |-------|-------------|--------------|-------|
28
+ | Claude Opus 4.6 | $5.00 | $25.00 | Highest capability |
29
+ | GPT-5.2 Pro | $21.00 | $168.00 | Reasoning-heavy |
30
+
31
+ **Savings potential**: Highest. A specialized 8B model costs ~$0.05-0.15/M tokens to serve = 100-500x cheaper.
32
+
33
+ ### Standard Tier ($1.75-14/M output tokens)
34
+ | Model | Input ($/M) | Output ($/M) | Notes |
35
+ |-------|-------------|--------------|-------|
36
+ | GPT-5.2 Chat | $1.75 | $14.00 | Most popular |
37
+ | Claude Sonnet 4.5 | $3.00 | $15.00 | Good balance |
38
+ | Qwen3 Max | $1.20 | $6.00 | Cost-effective |
39
+
40
+ **Savings potential**: Strong at volume. Break-even is achievable with moderate traffic.
41
+
42
+ ### Budget Tier ($0.10-3/M output tokens)
43
+ | Model | Input ($/M) | Output ($/M) | Notes |
44
+ |-------|-------------|--------------|-------|
45
+ | Gemini 3 Flash | $0.50 | $3.00 | Fast, cheap |
46
+ | GLM 4.7 | $0.40 | $1.50 | Very cost-effective |
47
+ | Mistral Small | $0.10 | $0.30 | Cheapest commercial |
48
+
49
+ **Savings potential**: Harder to beat on cost alone. Specialize for quality, not cost.
50
+
51
+ ### Free / Near-Free
52
+ Many open-weight models available on OpenRouter at $0. If you only need basic capability, self-hosting these is essentially free beyond GPU cost.
53
+
54
+ ## Training Cost Estimation
55
+
56
+ ### Tinker (Managed LoRA)
57
+ Cheapest option for supported models. No GPU management.
58
+
59
+ | Model Size | Method | Cost/1K examples | Duration |
60
+ |------------|--------|-------------------|----------|
61
+ | 8B | LoRA | ~$2-5 | 15-30 min |
62
+ | 14B | LoRA | ~$5-10 | 30-60 min |
63
+ | 32B | LoRA | ~$10-25 | 1-3 hours |
64
+ | 70B | LoRA | ~$25-60 | 3-8 hours |
65
+
66
+ Load `tinker-training-cost` skill for exact per-token rates.
67
+
68
+ ### Modal (Serverless GPU)
69
+
70
+ | GPU | Cost/Hour | Best For |
71
+ |-----|-----------|----------|
72
+ | A100-40GB | ~$1.10 | 7-14B LoRA, 7B full |
73
+ | A100-80GB | ~$1.60 | 14-32B LoRA, 14B full |
74
+ | H100-80GB | ~$3.25 | 32-70B LoRA, large batches |
75
+
76
+ **Typical training runs**:
77
+ - 8B LoRA, 5K examples, 3 epochs: ~1-2 hours on A100 = $1-3
78
+ - 14B LoRA, 10K examples, 3 epochs: ~3-5 hours on A100-80GB = $5-8
79
+ - 70B LoRA, 10K examples, 2 epochs: ~8-12 hours on H100 = $26-39
80
+
81
+ ### TensorPool (Dedicated GPU)
82
+
83
+ | GPU | Cost/Hour | Best For |
84
+ |-----|-----------|----------|
85
+ | A100-80GB | ~$1.50 | Standard training |
86
+ | H200-141GB | ~$3.00 | Large models, fast training |
87
+ | B200-192GB | ~$4.50 | Cutting-edge, maximum throughput |
88
+
89
+ Best for: Multi-day training, large-scale RL, dedicated inference.
90
+
91
+ ## Inference Cost Comparison
92
+
93
+ The key calculation: what does it cost to serve the specialized model vs the frontier API?
94
+
95
+ ### Self-Hosted Inference Cost
96
+
97
+ ```python
98
+ def inference_cost_per_million_tokens(gpu_cost_per_hour, tokens_per_second):
99
+ """Calculate cost per million tokens for self-hosted inference.
100
+
101
+ Args:
102
+ gpu_cost_per_hour: GPU rental cost (e.g., $3.25 for H100)
103
+ tokens_per_second: Model throughput (varies by model size and batch)
104
+ """
105
+ tokens_per_hour = tokens_per_second * 3600
106
+ cost_per_token = gpu_cost_per_hour / tokens_per_hour
107
+ cost_per_million = cost_per_token * 1_000_000
108
+ return cost_per_million
109
+ ```
110
+
111
+ ### Reference Throughput (vLLM, single GPU)
112
+
113
+ | Model Size | GPU | Tokens/sec (batch) | Cost/M tokens |
114
+ |------------|-----|-------------------|---------------|
115
+ | 8B | A100-80GB | ~3,000-5,000 | $0.09-0.15 |
116
+ | 8B | H100-80GB | ~5,000-8,000 | $0.11-0.18 |
117
+ | 14B | A100-80GB | ~1,500-3,000 | $0.15-0.30 |
118
+ | 14B | H100-80GB | ~3,000-5,000 | $0.18-0.30 |
119
+ | 32B | H100-80GB | ~800-1,500 | $0.60-1.10 |
120
+ | 70B | 2xH100 | ~500-1,000 | $1.80-3.60 |
121
+
122
+ **Key insight**: A specialized 8B model on Modal/TensorPool costs **$0.05-0.15/M tokens** vs **$5-25/M** for premium frontier = **50-500x cheaper per request**.
123
+
124
+ ## Break-Even Calculator
125
+
126
+ ```python
127
+ def break_even_analysis(
128
+ monthly_api_spend: float,
129
+ training_cost: float,
130
+ monthly_inference_cost: float,
131
+ setup_hours: float = 40,
132
+ hourly_rate: float = 100,
133
+ ):
134
+ """Calculate break-even timeline for model specialization.
135
+
136
+ Args:
137
+ monthly_api_spend: Current monthly frontier API cost
138
+ training_cost: One-time training cost (GPU + data prep)
139
+ monthly_inference_cost: Monthly cost to serve specialized model
140
+ setup_hours: Engineering time to set up pipeline
141
+ hourly_rate: Engineering cost per hour
142
+ """
143
+ engineering_cost = setup_hours * hourly_rate
144
+ total_upfront = training_cost + engineering_cost
145
+ monthly_savings = monthly_api_spend - monthly_inference_cost
146
+
147
+ if monthly_savings <= 0:
148
+ return {
149
+ "viable": False,
150
+ "reason": "Specialized model costs more to serve than frontier API",
151
+ "monthly_savings": monthly_savings,
152
+ }
153
+
154
+ months_to_break_even = total_upfront / monthly_savings
155
+
156
+ return {
157
+ "viable": True,
158
+ "total_upfront_cost": total_upfront,
159
+ "training_cost": training_cost,
160
+ "engineering_cost": engineering_cost,
161
+ "monthly_api_spend": monthly_api_spend,
162
+ "monthly_inference_cost": monthly_inference_cost,
163
+ "monthly_savings": monthly_savings,
164
+ "months_to_break_even": round(months_to_break_even, 1),
165
+ "annual_savings": monthly_savings * 12 - total_upfront,
166
+ "year_2_savings": monthly_savings * 12, # Fully amortized
167
+ }
168
+
169
+
170
+ # Example: Replace GPT-4o for a specific task
171
+ result = break_even_analysis(
172
+ monthly_api_spend=5000, # $5K/month on GPT-4o
173
+ training_cost=500, # $500 for LoRA fine-tune + data prep
174
+ monthly_inference_cost=200, # $200/month on Modal (8B model)
175
+ setup_hours=40, # 1 week of engineering
176
+ hourly_rate=100, # $100/hr engineer
177
+ )
178
+ # Result: Break-even in ~1.1 months, $52K annual savings after year 1
179
+ ```
180
+
181
+ ## ROI Timeline Template
182
+
183
+ | Month | Frontier Cost | Specialized Cost | Cumulative Savings | Notes |
184
+ |-------|--------------|-----------------|-------------------|-------|
185
+ | 0 | — | $4,500 upfront | -$4,500 | Training + engineering |
186
+ | 1 | $5,000 | $200 | $300 | First month live |
187
+ | 2 | $5,000 | $200 | $5,100 | Break-even |
188
+ | 3 | $5,000 | $200 | $9,900 | |
189
+ | 6 | $5,000 | $200 | $24,300 | |
190
+ | 12 | $5,000 | $300 | $52,200 | Includes retrain cost |
191
+
192
+ ## Decision Framework
193
+
194
+ ### Strong Case for Specialization
195
+ - Monthly API spend > $1,000
196
+ - Constrained task (classification, extraction, formatting, domain Q&A)
197
+ - Production data available (logs, corrections, accept/reject signals)
198
+ - Quality requirements are well-defined and measurable
199
+ - Task doesn't change frequently
200
+
201
+ ### Weak Case for Specialization
202
+ - Monthly API spend < $500
203
+ - Diverse, open-ended tasks (general assistant)
204
+ - No production data to train on
205
+ - Rapid model improvement expected (new frontier models releasing soon)
206
+ - Task requires reasoning at the frontier level
207
+ - Small team with no ML engineering capacity
208
+
209
+ ### Hybrid Approach
210
+ Route easy requests to specialized model, hard ones to frontier:
211
+ - **90% of requests** → Specialized 8B (cheap, fast)
212
+ - **10% of requests** → Frontier fallback (complex, novel)
213
+ - **Result**: 80-90% cost reduction while maintaining quality on hard cases
214
+
215
+ ## Quick Assessment Questions
216
+
217
+ Ask the user these to determine viability:
218
+
219
+ 1. **What's your monthly frontier API spend?**
220
+ - < $500: Probably not worth it (low ROI)
221
+ - $500-2K: Worth investigating
222
+ - $2K+: Strong candidate
223
+ - $10K+: Almost certainly worth it
224
+
225
+ 2. **How constrained is the task?**
226
+ - Single task (classification, extraction): Great fit
227
+ - Few related tasks: Good fit
228
+ - Many diverse tasks: Harder to specialize
229
+
230
+ 3. **What production data do you have?**
231
+ - API logs with user feedback: Excellent
232
+ - API logs without feedback: Good (can distill)
233
+ - No production data: Need synthetic bootstrap
234
+
235
+ 4. **What's your quality bar?**
236
+ - Must match frontier exactly: Harder, may need bigger model
237
+ - 90% of frontier quality is fine: 8B LoRA likely sufficient
238
+ - Task-specific metrics (accuracy, format): Easiest to optimize