tech-hub-skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (133) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +250 -0
  3. package/bin/cli.js +241 -0
  4. package/bin/copilot.js +182 -0
  5. package/bin/postinstall.js +42 -0
  6. package/package.json +46 -0
  7. package/tech_hub_skills/roles/ai-engineer/skills/01-prompt-engineering/README.md +252 -0
  8. package/tech_hub_skills/roles/ai-engineer/skills/02-rag-pipeline/README.md +448 -0
  9. package/tech_hub_skills/roles/ai-engineer/skills/03-agent-orchestration/README.md +599 -0
  10. package/tech_hub_skills/roles/ai-engineer/skills/04-llm-guardrails/README.md +735 -0
  11. package/tech_hub_skills/roles/ai-engineer/skills/05-vector-embeddings/README.md +711 -0
  12. package/tech_hub_skills/roles/ai-engineer/skills/06-llm-evaluation/README.md +777 -0
  13. package/tech_hub_skills/roles/azure/skills/01-infrastructure-fundamentals/README.md +264 -0
  14. package/tech_hub_skills/roles/azure/skills/02-data-factory/README.md +264 -0
  15. package/tech_hub_skills/roles/azure/skills/03-synapse-analytics/README.md +264 -0
  16. package/tech_hub_skills/roles/azure/skills/04-databricks/README.md +264 -0
  17. package/tech_hub_skills/roles/azure/skills/05-functions/README.md +264 -0
  18. package/tech_hub_skills/roles/azure/skills/06-kubernetes-service/README.md +264 -0
  19. package/tech_hub_skills/roles/azure/skills/07-openai-service/README.md +264 -0
  20. package/tech_hub_skills/roles/azure/skills/08-machine-learning/README.md +264 -0
  21. package/tech_hub_skills/roles/azure/skills/09-storage-adls/README.md +264 -0
  22. package/tech_hub_skills/roles/azure/skills/10-networking/README.md +264 -0
  23. package/tech_hub_skills/roles/azure/skills/11-sql-cosmos/README.md +264 -0
  24. package/tech_hub_skills/roles/azure/skills/12-event-hubs/README.md +264 -0
  25. package/tech_hub_skills/roles/code-review/skills/01-automated-code-review/README.md +394 -0
  26. package/tech_hub_skills/roles/code-review/skills/02-pr-review-workflow/README.md +427 -0
  27. package/tech_hub_skills/roles/code-review/skills/03-code-quality-gates/README.md +518 -0
  28. package/tech_hub_skills/roles/code-review/skills/04-reviewer-assignment/README.md +504 -0
  29. package/tech_hub_skills/roles/code-review/skills/05-review-analytics/README.md +540 -0
  30. package/tech_hub_skills/roles/data-engineer/skills/01-lakehouse-architecture/README.md +550 -0
  31. package/tech_hub_skills/roles/data-engineer/skills/02-etl-pipeline/README.md +580 -0
  32. package/tech_hub_skills/roles/data-engineer/skills/03-data-quality/README.md +579 -0
  33. package/tech_hub_skills/roles/data-engineer/skills/04-streaming-pipelines/README.md +608 -0
  34. package/tech_hub_skills/roles/data-engineer/skills/05-performance-optimization/README.md +547 -0
  35. package/tech_hub_skills/roles/data-governance/skills/01-data-catalog/README.md +112 -0
  36. package/tech_hub_skills/roles/data-governance/skills/02-data-lineage/README.md +129 -0
  37. package/tech_hub_skills/roles/data-governance/skills/03-data-quality-framework/README.md +182 -0
  38. package/tech_hub_skills/roles/data-governance/skills/04-access-control/README.md +39 -0
  39. package/tech_hub_skills/roles/data-governance/skills/05-master-data-management/README.md +40 -0
  40. package/tech_hub_skills/roles/data-governance/skills/06-compliance-privacy/README.md +46 -0
  41. package/tech_hub_skills/roles/data-scientist/skills/01-eda-automation/README.md +230 -0
  42. package/tech_hub_skills/roles/data-scientist/skills/02-statistical-modeling/README.md +264 -0
  43. package/tech_hub_skills/roles/data-scientist/skills/03-feature-engineering/README.md +264 -0
  44. package/tech_hub_skills/roles/data-scientist/skills/04-predictive-modeling/README.md +264 -0
  45. package/tech_hub_skills/roles/data-scientist/skills/05-customer-analytics/README.md +264 -0
  46. package/tech_hub_skills/roles/data-scientist/skills/06-campaign-analysis/README.md +264 -0
  47. package/tech_hub_skills/roles/data-scientist/skills/07-experimentation/README.md +264 -0
  48. package/tech_hub_skills/roles/data-scientist/skills/08-data-visualization/README.md +264 -0
  49. package/tech_hub_skills/roles/devops/skills/01-cicd-pipeline/README.md +264 -0
  50. package/tech_hub_skills/roles/devops/skills/02-container-orchestration/README.md +264 -0
  51. package/tech_hub_skills/roles/devops/skills/03-infrastructure-as-code/README.md +264 -0
  52. package/tech_hub_skills/roles/devops/skills/04-gitops/README.md +264 -0
  53. package/tech_hub_skills/roles/devops/skills/05-environment-management/README.md +264 -0
  54. package/tech_hub_skills/roles/devops/skills/06-automated-testing/README.md +264 -0
  55. package/tech_hub_skills/roles/devops/skills/07-release-management/README.md +264 -0
  56. package/tech_hub_skills/roles/devops/skills/08-monitoring-alerting/README.md +264 -0
  57. package/tech_hub_skills/roles/devops/skills/09-devsecops/README.md +265 -0
  58. package/tech_hub_skills/roles/finops/skills/01-cost-visibility/README.md +264 -0
  59. package/tech_hub_skills/roles/finops/skills/02-resource-tagging/README.md +264 -0
  60. package/tech_hub_skills/roles/finops/skills/03-budget-management/README.md +264 -0
  61. package/tech_hub_skills/roles/finops/skills/04-reserved-instances/README.md +264 -0
  62. package/tech_hub_skills/roles/finops/skills/05-spot-optimization/README.md +264 -0
  63. package/tech_hub_skills/roles/finops/skills/06-storage-tiering/README.md +264 -0
  64. package/tech_hub_skills/roles/finops/skills/07-compute-rightsizing/README.md +264 -0
  65. package/tech_hub_skills/roles/finops/skills/08-chargeback/README.md +264 -0
  66. package/tech_hub_skills/roles/ml-engineer/skills/01-mlops-pipeline/README.md +566 -0
  67. package/tech_hub_skills/roles/ml-engineer/skills/02-feature-engineering/README.md +655 -0
  68. package/tech_hub_skills/roles/ml-engineer/skills/03-model-training/README.md +704 -0
  69. package/tech_hub_skills/roles/ml-engineer/skills/04-model-serving/README.md +845 -0
  70. package/tech_hub_skills/roles/ml-engineer/skills/05-model-monitoring/README.md +874 -0
  71. package/tech_hub_skills/roles/mlops/skills/01-ml-pipeline-orchestration/README.md +264 -0
  72. package/tech_hub_skills/roles/mlops/skills/02-experiment-tracking/README.md +264 -0
  73. package/tech_hub_skills/roles/mlops/skills/03-model-registry/README.md +264 -0
  74. package/tech_hub_skills/roles/mlops/skills/04-feature-store/README.md +264 -0
  75. package/tech_hub_skills/roles/mlops/skills/05-model-deployment/README.md +264 -0
  76. package/tech_hub_skills/roles/mlops/skills/06-model-observability/README.md +264 -0
  77. package/tech_hub_skills/roles/mlops/skills/07-data-versioning/README.md +264 -0
  78. package/tech_hub_skills/roles/mlops/skills/08-ab-testing/README.md +264 -0
  79. package/tech_hub_skills/roles/mlops/skills/09-automated-retraining/README.md +264 -0
  80. package/tech_hub_skills/roles/platform-engineer/skills/01-internal-developer-platform/README.md +153 -0
  81. package/tech_hub_skills/roles/platform-engineer/skills/02-self-service-infrastructure/README.md +57 -0
  82. package/tech_hub_skills/roles/platform-engineer/skills/03-slo-sli-management/README.md +59 -0
  83. package/tech_hub_skills/roles/platform-engineer/skills/04-developer-experience/README.md +57 -0
  84. package/tech_hub_skills/roles/platform-engineer/skills/05-incident-management/README.md +73 -0
  85. package/tech_hub_skills/roles/platform-engineer/skills/06-capacity-management/README.md +59 -0
  86. package/tech_hub_skills/roles/product-designer/skills/01-requirements-discovery/README.md +407 -0
  87. package/tech_hub_skills/roles/product-designer/skills/02-user-research/README.md +382 -0
  88. package/tech_hub_skills/roles/product-designer/skills/03-brainstorming-ideation/README.md +437 -0
  89. package/tech_hub_skills/roles/product-designer/skills/04-ux-design/README.md +496 -0
  90. package/tech_hub_skills/roles/product-designer/skills/05-product-market-fit/README.md +376 -0
  91. package/tech_hub_skills/roles/product-designer/skills/06-stakeholder-management/README.md +412 -0
  92. package/tech_hub_skills/roles/security-architect/skills/01-pii-detection/README.md +319 -0
  93. package/tech_hub_skills/roles/security-architect/skills/02-threat-modeling/README.md +264 -0
  94. package/tech_hub_skills/roles/security-architect/skills/03-infrastructure-security/README.md +264 -0
  95. package/tech_hub_skills/roles/security-architect/skills/04-iam/README.md +264 -0
  96. package/tech_hub_skills/roles/security-architect/skills/05-application-security/README.md +264 -0
  97. package/tech_hub_skills/roles/security-architect/skills/06-secrets-management/README.md +264 -0
  98. package/tech_hub_skills/roles/security-architect/skills/07-security-monitoring/README.md +264 -0
  99. package/tech_hub_skills/roles/system-design/skills/01-architecture-patterns/README.md +337 -0
  100. package/tech_hub_skills/roles/system-design/skills/02-requirements-engineering/README.md +264 -0
  101. package/tech_hub_skills/roles/system-design/skills/03-scalability/README.md +264 -0
  102. package/tech_hub_skills/roles/system-design/skills/04-high-availability/README.md +264 -0
  103. package/tech_hub_skills/roles/system-design/skills/05-cost-optimization-design/README.md +264 -0
  104. package/tech_hub_skills/roles/system-design/skills/06-api-design/README.md +264 -0
  105. package/tech_hub_skills/roles/system-design/skills/07-observability-architecture/README.md +264 -0
  106. package/tech_hub_skills/roles/system-design/skills/08-process-automation/PROCESS_TEMPLATE.md +336 -0
  107. package/tech_hub_skills/roles/system-design/skills/08-process-automation/README.md +521 -0
  108. package/tech_hub_skills/skills/README.md +336 -0
  109. package/tech_hub_skills/skills/ai-engineer.md +104 -0
  110. package/tech_hub_skills/skills/azure.md +149 -0
  111. package/tech_hub_skills/skills/code-review.md +399 -0
  112. package/tech_hub_skills/skills/compliance-automation.md +747 -0
  113. package/tech_hub_skills/skills/data-engineer.md +113 -0
  114. package/tech_hub_skills/skills/data-governance.md +102 -0
  115. package/tech_hub_skills/skills/data-scientist.md +123 -0
  116. package/tech_hub_skills/skills/devops.md +160 -0
  117. package/tech_hub_skills/skills/docker.md +160 -0
  118. package/tech_hub_skills/skills/enterprise-dashboard.md +613 -0
  119. package/tech_hub_skills/skills/finops.md +184 -0
  120. package/tech_hub_skills/skills/ml-engineer.md +115 -0
  121. package/tech_hub_skills/skills/mlops.md +187 -0
  122. package/tech_hub_skills/skills/optimization-advisor.md +329 -0
  123. package/tech_hub_skills/skills/orchestrator.md +497 -0
  124. package/tech_hub_skills/skills/platform-engineer.md +102 -0
  125. package/tech_hub_skills/skills/process-automation.md +226 -0
  126. package/tech_hub_skills/skills/process-changelog.md +184 -0
  127. package/tech_hub_skills/skills/process-documentation.md +484 -0
  128. package/tech_hub_skills/skills/process-kanban.md +324 -0
  129. package/tech_hub_skills/skills/process-versioning.md +214 -0
  130. package/tech_hub_skills/skills/product-designer.md +104 -0
  131. package/tech_hub_skills/skills/project-starter.md +443 -0
  132. package/tech_hub_skills/skills/security-architect.md +135 -0
  133. package/tech_hub_skills/skills/system-design.md +126 -0
@@ -0,0 +1,777 @@
1
+ # Skill 6: LLM Evaluation & Benchmarking
2
+
3
+ ## 🎯 Overview
4
+ Build comprehensive evaluation frameworks for LLM applications including automated testing, benchmark suites, A/B testing, and continuous quality monitoring for production systems.
5
+
6
+ ## 🔗 Connections
7
+ - **Data Engineer**: Test dataset curation, evaluation metrics storage (de-01, de-03)
8
+ - **Security Architect**: Adversarial testing, safety evaluation (sa-08)
9
+ - **ML Engineer**: Model comparison, performance benchmarking (ml-03, ml-05)
10
+ - **MLOps**: Continuous evaluation, metric tracking (mo-04, mo-05)
11
+ - **FinOps**: Cost-quality tradeoff analysis, evaluation budget optimization (fo-01, fo-07)
12
+ - **DevOps**: Automated testing in CI/CD, regression detection (do-01, do-06)
13
+ - **Data Scientist**: Statistical analysis, experiment design (ds-01, ds-08)
14
+
15
+ ## 🛠️ Tools Included
16
+
17
+ ### 1. `llm_evaluator.py`
18
+ Comprehensive evaluation framework with multiple metrics (accuracy, coherence, relevance, safety).
19
+
20
+ ### 2. `benchmark_runner.py`
21
+ Execute standard benchmarks (MMLU, HellaSwag, TruthfulQA) and custom task suites.
22
+
23
+ ### 3. `ab_test_framework.py`
24
+ A/B testing infrastructure for comparing models, prompts, and system configurations.
25
+
26
+ ### 4. `regression_detector.py`
27
+ Automated regression testing to catch quality degradation before production deployment.
28
+
29
+ ### 5. `eval_dataset_builder.py`
30
+ Create and manage evaluation datasets with versioning and golden reference answers.
31
+
32
+ ## 📊 Key Metrics
33
+ - Task accuracy and F1 score
34
+ - Response coherence and fluency
35
+ - Factual accuracy and hallucination rate
36
+ - Safety and bias scores
37
+ - Cost per evaluation
38
+
39
+ ## 🚀 Quick Start
40
+
41
+ ```python
42
+ from llm_evaluator import LLMEvaluator, EvaluationMetrics
43
+ from benchmark_runner import BenchmarkRunner
44
+
45
+ # Initialize evaluator
46
+ evaluator = LLMEvaluator(
47
+ metrics=[
48
+ "accuracy",
49
+ "coherence",
50
+ "factuality",
51
+ "safety",
52
+ "relevance"
53
+ ]
54
+ )
55
+
56
+ # Load test dataset
57
+ test_data = evaluator.load_dataset(
58
+ name="customer_support_qa",
59
+ version="v1.2"
60
+ )
61
+
62
+ # Evaluate model
63
+ results = evaluator.evaluate(
64
+ model="claude-3-5-sonnet-20241022",
65
+ test_data=test_data,
66
+ num_samples=500
67
+ )
68
+
69
+ # Print results
70
+ print(f"Accuracy: {results.accuracy:.3f}")
71
+ print(f"Coherence: {results.coherence:.3f}")
72
+ print(f"Factuality: {results.factuality:.3f}")
73
+ print(f"Safety: {results.safety:.3f}")
74
+ print(f"Cost: ${results.total_cost:.4f}")
75
+
76
+ # Run standard benchmarks
77
+ benchmark = BenchmarkRunner()
78
+ benchmark_results = benchmark.run(
79
+ model="claude-3-5-sonnet-20241022",
80
+ benchmarks=["mmlu", "hellaswag", "truthfulqa"]
81
+ )
82
+
83
+ print(f"\nBenchmark Results:")
84
+ for name, score in benchmark_results.items():
85
+ print(f" {name}: {score:.2%}")
86
+ ```
87
+
88
+ ## 📚 Best Practices
89
+
90
+ ### Cost Optimization (FinOps Integration)
91
+
92
+ 1. **Optimize Evaluation Frequency**
93
+ - Run full evals only on significant changes
94
+ - Use smoke tests for minor updates
95
+ - Implement tiered evaluation (quick → comprehensive)
96
+ - Schedule heavy evaluations during off-peak hours
97
+ - Reference: FinOps fo-07 (AI/ML Cost Optimization)
98
+
99
+ 2. **Sample-Based Evaluation**
100
+ - Use statistical sampling for large datasets
101
+ - Calculate confidence intervals
102
+ - Start with small samples, expand if needed
103
+ - Monitor sample size vs cost tradeoffs
104
+ - Reference: FinOps fo-03 (Budget Management)
105
+
106
+ 3. **Cache Evaluation Results**
107
+ - Cache model outputs for test sets
108
+ - Reuse evaluations across metrics
109
+ - Implement incremental evaluation
110
+ - Track cache hit rates
111
+ - Reference: FinOps fo-01 (Cost Monitoring)
112
+
113
+ 4. **Cost-Aware Benchmark Selection**
114
+ - Prioritize high-signal benchmarks
115
+ - Use lightweight metrics first
116
+ - Reserve expensive evals (human review) for critical cases
117
+ - Track evaluation cost per model
118
+ - Reference: FinOps fo-07 (AI/ML Cost Optimization)
119
+
120
+ ### Security & Privacy (Security Architect Integration)
121
+
122
+ 5. **Adversarial Testing**
123
+ - Test against prompt injection attacks
124
+ - Evaluate jailbreaking resistance
125
+ - Check for unsafe output generation
126
+ - Monitor for security regression
127
+ - Reference: Security Architect sa-08 (LLM Security)
128
+
129
+ 6. **Privacy-Preserving Evaluation**
130
+ - Anonymize evaluation datasets
131
+ - Remove PII from test cases
132
+ - Secure storage of evaluation results
133
+ - Audit access to evaluation data
134
+ - Reference: Security Architect sa-01 (PII Detection), sa-06 (Data Governance)
135
+
136
+ 7. **Safety Benchmark Suite**
137
+ - Evaluate toxic content generation
138
+ - Test bias across demographics
139
+ - Check compliance with safety policies
140
+ - Red team testing for edge cases
141
+ - Reference: Security Architect sa-08 (LLM Security)
142
+
143
+ ### Data Quality & Governance (Data Engineer Integration)
144
+
145
+ 8. **High-Quality Test Datasets**
146
+ - Curate diverse, representative test sets
147
+ - Include edge cases and failure modes
148
+ - Version test datasets with lineage
149
+ - Validate test data quality regularly
150
+ - Reference: Data Engineer de-03 (Data Quality)
151
+
152
+ 9. **Evaluation Data Pipeline**
153
+ - Automate test dataset updates
154
+ - Track dataset version and provenance
155
+ - Implement data validation checks
156
+ - Monitor dataset drift over time
157
+ - Reference: Data Engineer de-01 (Data Ingestion), de-02 (ETL)
158
+
159
+ ### Model Lifecycle Management (MLOps Integration)
160
+
161
+ 10. **Continuous Evaluation**
162
+ - Run evals on every model deployment
163
+ - Track metrics across model versions
164
+ - Set quality gates for production
165
+ - Alert on metric degradation
166
+ - Reference: MLOps mo-04 (Monitoring)
167
+
168
+ 11. **Evaluation Metrics Versioning**
169
+ - Version evaluation code and metrics
170
+ - Track metric definition changes
171
+ - Ensure reproducibility of results
172
+ - Maintain historical comparisons
173
+ - Reference: MLOps mo-01 (Model Registry), mo-03 (Versioning)
174
+
175
+ 12. **Performance Regression Detection**
176
+ - Compare new models against baselines
177
+ - Statistical significance testing
178
+ - Automated rollback on regression
179
+ - Track performance trends over time
180
+ - Reference: MLOps mo-05 (Drift Detection)
181
+
182
+ ### Deployment & Operations (DevOps Integration)
183
+
184
+ 13. **Automated Testing in CI/CD**
185
+ - Run evaluations in deployment pipeline
186
+ - Fail deployments on quality regression
187
+ - Parallel evaluation execution
188
+ - Generate evaluation reports automatically
189
+ - Reference: DevOps do-01 (CI/CD), do-06 (Testing)
190
+
191
+ 14. **Evaluation Infrastructure**
192
+ - Containerize evaluation workloads
193
+ - Distributed evaluation for speed
194
+ - Auto-scaling for benchmark runs
195
+ - Cost-optimized compute for evals
196
+ - Reference: DevOps do-03 (Containerization)
197
+
198
+ 15. **Observability for Evaluations**
199
+ - Track evaluation job status
200
+ - Monitor evaluation latency and costs
201
+ - Alert on evaluation failures
202
+ - Dashboard for evaluation metrics
203
+ - Reference: DevOps do-08 (Monitoring & Observability)
204
+
205
+ ### Azure-Specific Best Practices
206
+
207
+ 16. **Azure ML for Evaluation**
208
+ - Use Azure ML pipelines for evaluations
209
+ - Track experiments in Azure ML workspace
210
+ - Store evaluation datasets in Azure Storage
211
+ - Visualize results in Azure ML studio
212
+ - Reference: Azure az-04 (AI/ML Services)
213
+
214
+ 17. **Cost-Effective Evaluation Compute**
215
+ - Use spot instances for batch evaluations
216
+ - Right-size VM instances for workload
217
+ - Implement auto-shutdown after evals
218
+ - Monitor compute costs in Azure Cost Management
219
+ - Reference: Azure az-09 (Cost Management)
220
+
221
+ ## 💰 Cost Optimization Examples
222
+
223
+ ### Tiered Evaluation Strategy
224
+ ```python
225
+ from llm_evaluator import LLMEvaluator, EvaluationTier
226
+
227
+ class CostOptimizedEvaluator:
228
+ def __init__(self):
229
+ # Quick smoke test (low cost)
230
+ self.smoke_test = LLMEvaluator(
231
+ metrics=["basic_accuracy"],
232
+ sample_size=50,
233
+ cost_per_run=0.10
234
+ )
235
+
236
+ # Standard evaluation (medium cost)
237
+ self.standard_eval = LLMEvaluator(
238
+ metrics=["accuracy", "coherence", "relevance"],
239
+ sample_size=200,
240
+ cost_per_run=1.50
241
+ )
242
+
243
+ # Comprehensive evaluation (high cost)
244
+ self.comprehensive_eval = LLMEvaluator(
245
+ metrics=[
246
+ "accuracy", "coherence", "relevance",
247
+ "factuality", "safety", "bias"
248
+ ],
249
+ sample_size=1000,
250
+ cost_per_run=15.00
251
+ )
252
+
253
+ def evaluate_with_budget(self, model: str, change_type: str):
254
+ """Select evaluation tier based on change type."""
255
+
256
+ if change_type == "minor_update":
257
+ # Quick smoke test for minor changes
258
+ return self.smoke_test.evaluate(model)
259
+
260
+ elif change_type == "prompt_change":
261
+ # Standard eval for prompt/config changes
262
+ return self.standard_eval.evaluate(model)
263
+
264
+ elif change_type == "model_upgrade":
265
+ # Comprehensive eval for major changes
266
+ return self.comprehensive_eval.evaluate(model)
267
+
268
+ # Usage
269
+ evaluator = CostOptimizedEvaluator()
270
+
271
+ # Minor update: $0.10
272
+ smoke_results = evaluator.evaluate_with_budget(
273
+ model="claude-3-5-sonnet-20241022",
274
+ change_type="minor_update"
275
+ )
276
+
277
+ # Major update: $15.00 (but only when needed)
278
+ full_results = evaluator.evaluate_with_budget(
279
+ model="claude-opus-4-5-20251101",
280
+ change_type="model_upgrade"
281
+ )
282
+ ```
283
+
284
+ ### Statistical Sampling for Cost Reduction
285
+ ```python
286
+ import numpy as np
287
+ from scipy import stats
288
+
289
+ class SamplingEvaluator:
290
+ def __init__(self, full_dataset_size: int = 10000):
291
+ self.full_dataset = self.load_full_dataset()
292
+ self.full_dataset_size = full_dataset_size
293
+
294
+ def calculate_sample_size(
295
+ self,
296
+ confidence_level: float = 0.95,
297
+ margin_of_error: float = 0.02
298
+ ) -> int:
299
+ """Calculate minimum sample size for statistical validity."""
300
+ # For proportion estimation
301
+ z_score = stats.norm.ppf((1 + confidence_level) / 2)
302
+ p = 0.5 # Conservative estimate (maximum variance)
303
+
304
+ n = (z_score ** 2 * p * (1 - p)) / (margin_of_error ** 2)
305
+
306
+ # Finite population correction
307
+ n_adjusted = n / (1 + ((n - 1) / self.full_dataset_size))
308
+
309
+ return int(np.ceil(n_adjusted))
310
+
311
+ def evaluate_with_sampling(self, model: str, confidence: float = 0.95):
312
+ """Evaluate with statistically valid sampling."""
313
+ # Calculate required sample size
314
+ sample_size = self.calculate_sample_size(
315
+ confidence_level=confidence,
316
+ margin_of_error=0.02 # ±2% margin of error
317
+ )
318
+
319
+ print(f"Sample size: {sample_size} (vs {self.full_dataset_size} full)")
320
+
321
+ # Stratified sampling for better representation
322
+ sample = self.stratified_sample(self.full_dataset, sample_size)
323
+
324
+ # Evaluate on sample
325
+ results = self.evaluate(model, sample)
326
+
327
+ # Calculate confidence intervals
328
+ ci_lower, ci_upper = self.calculate_confidence_interval(
329
+ results.accuracy,
330
+ sample_size,
331
+ confidence
332
+ )
333
+
334
+ return {
335
+ "accuracy": results.accuracy,
336
+ "confidence_interval": (ci_lower, ci_upper),
337
+ "sample_size": sample_size,
338
+ "cost_saved": self.calculate_cost_savings(sample_size)
339
+ }
340
+
341
+ def calculate_cost_savings(self, sample_size: int) -> float:
342
+ """Calculate cost savings from sampling."""
343
+ full_cost = self.full_dataset_size * 0.002 # $0.002 per evaluation
344
+ sample_cost = sample_size * 0.002
345
+
346
+ savings = full_cost - sample_cost
347
+ savings_pct = (savings / full_cost) * 100
348
+
349
+ print(f"Cost savings: ${savings:.2f} ({savings_pct:.1f}%)")
350
+
351
+ return savings
352
+
353
+ # Usage
354
+ evaluator = SamplingEvaluator(full_dataset_size=10000)
355
+
356
+ # Evaluate with 95% confidence, ±2% margin of error
357
+ # Sample size: ~2400 (vs 10000 full)
358
+ # Cost: $4.80 (vs $20.00 full) → 76% savings
359
+ results = evaluator.evaluate_with_sampling(
360
+ model="claude-3-5-sonnet-20241022",
361
+ confidence=0.95
362
+ )
363
+
364
+ print(f"Accuracy: {results['accuracy']:.3f} "
365
+ f"±{(results['confidence_interval'][1] - results['accuracy']):.3f}")
366
+ ```
367
+
368
+ ### Cached Evaluation Results
369
+ ```python
370
+ from eval_cache import EvaluationCache
371
+ import hashlib
372
+
373
+ class CachedEvaluator:
374
+ def __init__(self):
375
+ self.evaluator = LLMEvaluator()
376
+ self.cache = EvaluationCache(ttl_days=30)
377
+
378
+ def evaluate(self, model: str, test_data: dict, metrics: List[str]):
379
+ """Evaluate with result caching."""
380
+ # Generate cache key from model, data, and metrics
381
+ cache_key = self._generate_cache_key(model, test_data, metrics)
382
+
383
+ # Check cache
384
+ cached_result = self.cache.get(cache_key)
385
+ if cached_result:
386
+ print("✅ Cache hit - evaluation cost saved!")
387
+ return cached_result
388
+
389
+ # Run evaluation
390
+ print("🔄 Cache miss - running evaluation...")
391
+ result = self.evaluator.evaluate(
392
+ model=model,
393
+ test_data=test_data,
394
+ metrics=metrics
395
+ )
396
+
397
+ # Cache the result
398
+ self.cache.set(cache_key, result)
399
+
400
+ return result
401
+
402
+ def _generate_cache_key(
403
+ self,
404
+ model: str,
405
+ test_data: dict,
406
+ metrics: List[str]
407
+ ) -> str:
408
+ """Generate unique cache key."""
409
+ # Hash test data to detect changes
410
+ data_hash = hashlib.md5(
411
+ str(test_data).encode()
412
+ ).hexdigest()
413
+
414
+ # Combine all components
415
+ key_components = f"{model}:{data_hash}:{','.join(sorted(metrics))}"
416
+
417
+ return hashlib.md5(key_components.encode()).hexdigest()
418
+
419
+ # Usage
420
+ evaluator = CachedEvaluator()
421
+
422
+ # First run: Cache miss, costs $5.00
423
+ results1 = evaluator.evaluate(
424
+ model="claude-3-5-sonnet-20241022",
425
+ test_data=test_dataset,
426
+ metrics=["accuracy", "coherence"]
427
+ )
428
+
429
+ # Second run: Cache hit, costs $0.00
430
+ results2 = evaluator.evaluate(
431
+ model="claude-3-5-sonnet-20241022",
432
+ test_data=test_dataset,
433
+ metrics=["accuracy", "coherence"]
434
+ )
435
+
436
+ # After prompt change: Run only new metrics, reuse cached ones
437
+ results3 = evaluator.evaluate(
438
+ model="claude-3-5-sonnet-20241022",
439
+ test_data=test_dataset,
440
+ metrics=["accuracy", "coherence", "safety"] # +safety (new)
441
+ )
442
+ # Cost: Only safety metric evaluation ($1.50 vs $5.00)
443
+ ```
444
+
445
+ ## 🔒 Security Best Practices Examples
446
+
447
+ ### Adversarial Testing Suite
448
+ ```python
449
+ from adversarial_tester import AdversarialTester
450
+
451
+ class SecurityEvaluator:
452
+ def __init__(self):
453
+ self.adversarial_tester = AdversarialTester()
454
+
455
+ def evaluate_security(self, model: str):
456
+ """Comprehensive security evaluation."""
457
+ results = {}
458
+
459
+ # 1. Prompt injection testing
460
+ injection_tests = self.adversarial_tester.test_prompt_injection(
461
+ model=model,
462
+ attack_patterns=[
463
+ "ignore_previous",
464
+ "role_switch",
465
+ "payload_splitting",
466
+ "virtualization"
467
+ ]
468
+ )
469
+ results["prompt_injection_resistance"] = injection_tests.block_rate
470
+
471
+ # 2. Jailbreaking attempts
472
+ jailbreak_tests = self.adversarial_tester.test_jailbreaking(
473
+ model=model,
474
+ techniques=[
475
+ "do_anything_now",
476
+ "character_roleplay",
477
+ "hypothetical_scenario"
478
+ ]
479
+ )
480
+ results["jailbreak_resistance"] = jailbreak_tests.block_rate
481
+
482
+ # 3. PII leakage testing
483
+ pii_tests = self.adversarial_tester.test_pii_leakage(
484
+ model=model,
485
+ pii_types=["ssn", "credit_card", "medical_record"]
486
+ )
487
+ results["pii_protection"] = 1 - pii_tests.leakage_rate
488
+
489
+ # 4. Toxic content generation
490
+ toxicity_tests = self.adversarial_tester.test_toxic_generation(
491
+ model=model,
492
+ categories=["hate", "violence", "sexual", "harassment"]
493
+ )
494
+ results["safety_filter_effectiveness"] = toxicity_tests.block_rate
495
+
496
+ return results
497
+
498
+ # Usage
499
+ security_eval = SecurityEvaluator()
500
+
501
+ security_scores = security_eval.evaluate_security(
502
+ model="claude-3-5-sonnet-20241022"
503
+ )
504
+
505
+ print("\n🔒 Security Evaluation Results:")
506
+ for metric, score in security_scores.items():
507
+ status = "✅ PASS" if score > 0.95 else "⚠️ REVIEW"
508
+ print(f" {metric}: {score:.2%} {status}")
509
+
510
+ # Fail deployment if security scores are too low
511
+ if security_scores["prompt_injection_resistance"] < 0.90:
512
+ raise SecurityError("Insufficient prompt injection protection")
513
+ ```
514
+
515
+ ### Safety Benchmark Suite
516
+ ```python
517
+ from safety_evaluator import SafetyEvaluator
518
+
519
+ class ComprehensiveSafetyEval:
520
+ def __init__(self):
521
+ self.safety_eval = SafetyEvaluator()
522
+
523
+ def run_safety_benchmarks(self, model: str):
524
+ """Run comprehensive safety evaluation."""
525
+ results = {}
526
+
527
+ # 1. Bias evaluation across demographics
528
+ bias_results = self.safety_eval.evaluate_bias(
529
+ model=model,
530
+ dimensions=[
531
+ "gender",
532
+ "race",
533
+ "religion",
534
+ "age",
535
+ "nationality"
536
+ ],
537
+ test_scenarios=1000
538
+ )
539
+ results["bias_scores"] = bias_results
540
+
541
+ # 2. Toxicity evaluation
542
+ toxicity_results = self.safety_eval.evaluate_toxicity(
543
+ model=model,
544
+ categories=[
545
+ "severe_toxicity",
546
+ "obscene",
547
+ "threat",
548
+ "insult"
549
+ ]
550
+ )
551
+ results["toxicity_scores"] = toxicity_results
552
+
553
+ # 3. Truthfulness evaluation
554
+ truthfulness_results = self.safety_eval.evaluate_truthfulness(
555
+ model=model,
556
+ benchmark="truthfulqa",
557
+ categories=["health", "finance", "law"]
558
+ )
559
+ results["truthfulness_score"] = truthfulness_results.accuracy
560
+
561
+ # 4. Compliance testing
562
+ compliance_results = self.safety_eval.evaluate_compliance(
563
+ model=model,
564
+ standards=["hipaa", "gdpr", "financial_advice"]
565
+ )
566
+ results["compliance_scores"] = compliance_results
567
+
568
+ return results
569
+
570
+ def generate_safety_report(self, results: dict):
571
+ """Generate comprehensive safety report."""
572
+ report = "# Safety Evaluation Report\n\n"
573
+
574
+ # Bias scores
575
+ report += "## Bias Evaluation\n"
576
+ for dimension, score in results["bias_scores"].items():
577
+ report += f"- {dimension}: {score:.3f}\n"
578
+
579
+ # Toxicity scores
580
+ report += "\n## Toxicity Evaluation\n"
581
+ for category, score in results["toxicity_scores"].items():
582
+ report += f"- {category}: {score:.3f}\n"
583
+
584
+ # Truthfulness
585
+ report += f"\n## Truthfulness\n"
586
+ report += f"- Score: {results['truthfulness_score']:.3f}\n"
587
+
588
+ # Compliance
589
+ report += "\n## Compliance\n"
590
+ for standard, passed in results["compliance_scores"].items():
591
+ status = "✅ PASS" if passed else "❌ FAIL"
592
+ report += f"- {standard}: {status}\n"
593
+
594
+ return report
595
+
596
+ # Usage
597
+ safety_eval = ComprehensiveSafetyEval()
598
+
599
+ results = safety_eval.run_safety_benchmarks(
600
+ model="claude-3-5-sonnet-20241022"
601
+ )
602
+
603
+ report = safety_eval.generate_safety_report(results)
604
+ print(report)
605
+
606
+ # Save report for compliance
607
+ with open("safety_evaluation_report.md", "w") as f:
608
+ f.write(report)
609
+ ```
610
+
611
+ ## 📊 Enhanced Metrics & Monitoring
612
+
613
+ | Metric Category | Metric | Target | Tool |
614
+ |-----------------|--------|--------|------|
615
+ | **Task Performance** | Accuracy | >0.90 | Custom evaluator |
616
+ | | F1 Score | >0.85 | scikit-learn |
617
+ | | BLEU score (generation) | >0.40 | nltk |
618
+ | | ROUGE-L (summarization) | >0.45 | rouge-score |
619
+ | **Benchmark Scores** | MMLU (reasoning) | >80% | Benchmark runner |
620
+ | | HellaSwag (common sense) | >85% | Benchmark runner |
621
+ | | TruthfulQA (truthfulness) | >75% | Benchmark runner |
622
+ | **Quality Metrics** | Coherence score | >0.85 | Custom evaluator |
623
+ | | Relevance score | >0.90 | Custom evaluator |
624
+ | | Hallucination rate | <5% | Factuality checker |
625
+ | **Safety & Bias** | Toxicity score | <0.02 | Perspective API |
626
+ | | Bias score (demographics) | <0.10 | Fairness evaluator |
627
+ | | Safety filter pass rate | >0.98 | Safety evaluator |
628
+ | **Costs** | Evaluation cost per model | <$50 | Cost tracker |
629
+ | | Cost per test sample | <$0.05 | Cost analyzer |
630
+ | | Cache hit rate | >50% | Eval cache |
631
+ | **Performance** | Evaluation runtime (1K samples) | <30min | Time tracker |
632
+ | | Throughput (samples/sec) | >5 | Benchmark runner |
633
+
634
+ ## 🚀 Deployment Pipeline
635
+
636
+ ### CI/CD with Automated Evaluation
637
+ ```yaml
638
+ # .github/workflows/llm-evaluation-pipeline.yml
639
+ name: LLM Evaluation Pipeline
640
+
641
+ on:
642
+ pull_request:
643
+ paths:
644
+ - 'models/**'
645
+ - 'prompts/**'
646
+ push:
647
+ branches:
648
+ - main
649
+
650
+ jobs:
651
+ smoke-test:
652
+ runs-on: ubuntu-latest
653
+ steps:
654
+ - name: Quick smoke test (50 samples)
655
+ run: python scripts/run_smoke_test.py --samples 50
656
+
657
+ - name: Check basic accuracy
658
+ run: |
659
+ python scripts/check_accuracy.py --min-threshold 0.80
660
+
661
+ standard-evaluation:
662
+ needs: smoke-test
663
+ runs-on: ubuntu-latest
664
+ steps:
665
+ - name: Run standard evaluation (200 samples)
666
+ run: |
667
+ python scripts/run_evaluation.py \
668
+ --metrics accuracy,coherence,relevance \
669
+ --samples 200
670
+
671
+ - name: Check for regression
672
+ run: |
673
+ python scripts/check_regression.py \
674
+ --baseline-version v1.0 \
675
+ --max-degradation 0.05
676
+
677
+ - name: Generate evaluation report
678
+ run: python scripts/generate_eval_report.py
679
+
680
+ - name: Upload results to MLflow
681
+ run: python scripts/upload_to_mlflow.py
682
+
683
+ comprehensive-evaluation:
684
+ needs: standard-evaluation
685
+ if: github.event_name == 'push' && github.ref == 'refs/heads/main'
686
+ runs-on: ubuntu-latest
687
+ steps:
688
+ - name: Run comprehensive evaluation (1000 samples)
689
+ run: |
690
+ python scripts/run_comprehensive_eval.py \
691
+ --metrics all \
692
+ --samples 1000
693
+
694
+ - name: Run benchmark suite
695
+ run: python scripts/run_benchmarks.py --benchmarks mmlu,hellaswag,truthfulqa
696
+
697
+ - name: Security evaluation
698
+ run: python scripts/run_security_eval.py
699
+
700
+ - name: Generate final report
701
+ run: python scripts/generate_final_report.py
702
+
703
+ - name: Quality gate check
704
+ run: |
705
+ python scripts/quality_gate.py \
706
+ --min-accuracy 0.90 \
707
+ --max-toxicity 0.02 \
708
+ --min-safety 0.98
709
+
710
+ deploy-if-passing:
711
+ needs: comprehensive-evaluation
712
+ runs-on: ubuntu-latest
713
+ environment: production
714
+ steps:
715
+ - name: Deploy to production
716
+ if: success()
717
+ run: python scripts/deploy_model.py --environment production
718
+
719
+ - name: Monitor post-deployment
720
+ run: python scripts/monitor_production.py --duration 2h
721
+ ```
722
+
723
+ ## 🔄 Integration Workflow
724
+
725
+ ### End-to-End Evaluation Pipeline with All Roles
726
+ ```
727
+ 1. Code/Model Change Committed
728
+
729
+ 2. Trigger CI/CD Pipeline (do-01)
730
+
731
+ 3. Load Test Dataset (de-01)
732
+
733
+ 4. Validate Test Data Quality (de-03)
734
+
735
+ 5. Run Smoke Test (50 samples) (ai-06)
736
+
737
+ 6. Basic Regression Check (mo-05)
738
+
739
+ 7. Standard Evaluation (200 samples) (ai-06)
740
+
741
+ 8. Cost Tracking (fo-01)
742
+
743
+ 9. Security Evaluation (sa-08)
744
+
745
+ 10. Bias & Safety Testing (ds-01)
746
+
747
+ 11. Statistical Significance Test (ds-08)
748
+
749
+ 12. Quality Gate Check (mo-04)
750
+
751
+ 13. Comprehensive Evaluation (if passing) (ai-06)
752
+
753
+ 14. Benchmark Suite Execution (ai-06)
754
+
755
+ 15. Generate Evaluation Report (ai-06)
756
+
757
+ 16. Upload Metrics to MLflow (mo-01)
758
+
759
+ 17. Final Quality Gate (mo-04)
760
+
761
+ 18. Deploy if All Checks Pass (do-01)
762
+
763
+ 19. Post-Deployment Monitoring (mo-04)
764
+
765
+ 20. Continuous Evaluation in Production (mo-04)
766
+ ```
767
+
768
+ ## 🎯 Quick Wins
769
+
770
+ 1. **Implement smoke tests** - Catch major regressions quickly with 50-sample tests
771
+ 2. **Use statistical sampling** - 75% cost reduction with valid confidence intervals
772
+ 3. **Cache evaluation results** - Reuse evaluations across multiple metrics
773
+ 4. **Set up automated regression testing** - Block deployments on quality degradation
774
+ 5. **Run tiered evaluations** - Quick tests for minor changes, comprehensive for major
775
+ 6. **Add security evaluation** - Test against adversarial attacks before production
776
+ 7. **Track evaluation costs** - Monitor and optimize evaluation budget
777
+ 8. **Use distributed evaluation** - Parallelize for 10x faster benchmark execution