tech-hub-skills 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +250 -0
- package/bin/cli.js +241 -0
- package/bin/copilot.js +182 -0
- package/bin/postinstall.js +42 -0
- package/package.json +46 -0
- package/tech_hub_skills/roles/ai-engineer/skills/01-prompt-engineering/README.md +252 -0
- package/tech_hub_skills/roles/ai-engineer/skills/02-rag-pipeline/README.md +448 -0
- package/tech_hub_skills/roles/ai-engineer/skills/03-agent-orchestration/README.md +599 -0
- package/tech_hub_skills/roles/ai-engineer/skills/04-llm-guardrails/README.md +735 -0
- package/tech_hub_skills/roles/ai-engineer/skills/05-vector-embeddings/README.md +711 -0
- package/tech_hub_skills/roles/ai-engineer/skills/06-llm-evaluation/README.md +777 -0
- package/tech_hub_skills/roles/azure/skills/01-infrastructure-fundamentals/README.md +264 -0
- package/tech_hub_skills/roles/azure/skills/02-data-factory/README.md +264 -0
- package/tech_hub_skills/roles/azure/skills/03-synapse-analytics/README.md +264 -0
- package/tech_hub_skills/roles/azure/skills/04-databricks/README.md +264 -0
- package/tech_hub_skills/roles/azure/skills/05-functions/README.md +264 -0
- package/tech_hub_skills/roles/azure/skills/06-kubernetes-service/README.md +264 -0
- package/tech_hub_skills/roles/azure/skills/07-openai-service/README.md +264 -0
- package/tech_hub_skills/roles/azure/skills/08-machine-learning/README.md +264 -0
- package/tech_hub_skills/roles/azure/skills/09-storage-adls/README.md +264 -0
- package/tech_hub_skills/roles/azure/skills/10-networking/README.md +264 -0
- package/tech_hub_skills/roles/azure/skills/11-sql-cosmos/README.md +264 -0
- package/tech_hub_skills/roles/azure/skills/12-event-hubs/README.md +264 -0
- package/tech_hub_skills/roles/code-review/skills/01-automated-code-review/README.md +394 -0
- package/tech_hub_skills/roles/code-review/skills/02-pr-review-workflow/README.md +427 -0
- package/tech_hub_skills/roles/code-review/skills/03-code-quality-gates/README.md +518 -0
- package/tech_hub_skills/roles/code-review/skills/04-reviewer-assignment/README.md +504 -0
- package/tech_hub_skills/roles/code-review/skills/05-review-analytics/README.md +540 -0
- package/tech_hub_skills/roles/data-engineer/skills/01-lakehouse-architecture/README.md +550 -0
- package/tech_hub_skills/roles/data-engineer/skills/02-etl-pipeline/README.md +580 -0
- package/tech_hub_skills/roles/data-engineer/skills/03-data-quality/README.md +579 -0
- package/tech_hub_skills/roles/data-engineer/skills/04-streaming-pipelines/README.md +608 -0
- package/tech_hub_skills/roles/data-engineer/skills/05-performance-optimization/README.md +547 -0
- package/tech_hub_skills/roles/data-governance/skills/01-data-catalog/README.md +112 -0
- package/tech_hub_skills/roles/data-governance/skills/02-data-lineage/README.md +129 -0
- package/tech_hub_skills/roles/data-governance/skills/03-data-quality-framework/README.md +182 -0
- package/tech_hub_skills/roles/data-governance/skills/04-access-control/README.md +39 -0
- package/tech_hub_skills/roles/data-governance/skills/05-master-data-management/README.md +40 -0
- package/tech_hub_skills/roles/data-governance/skills/06-compliance-privacy/README.md +46 -0
- package/tech_hub_skills/roles/data-scientist/skills/01-eda-automation/README.md +230 -0
- package/tech_hub_skills/roles/data-scientist/skills/02-statistical-modeling/README.md +264 -0
- package/tech_hub_skills/roles/data-scientist/skills/03-feature-engineering/README.md +264 -0
- package/tech_hub_skills/roles/data-scientist/skills/04-predictive-modeling/README.md +264 -0
- package/tech_hub_skills/roles/data-scientist/skills/05-customer-analytics/README.md +264 -0
- package/tech_hub_skills/roles/data-scientist/skills/06-campaign-analysis/README.md +264 -0
- package/tech_hub_skills/roles/data-scientist/skills/07-experimentation/README.md +264 -0
- package/tech_hub_skills/roles/data-scientist/skills/08-data-visualization/README.md +264 -0
- package/tech_hub_skills/roles/devops/skills/01-cicd-pipeline/README.md +264 -0
- package/tech_hub_skills/roles/devops/skills/02-container-orchestration/README.md +264 -0
- package/tech_hub_skills/roles/devops/skills/03-infrastructure-as-code/README.md +264 -0
- package/tech_hub_skills/roles/devops/skills/04-gitops/README.md +264 -0
- package/tech_hub_skills/roles/devops/skills/05-environment-management/README.md +264 -0
- package/tech_hub_skills/roles/devops/skills/06-automated-testing/README.md +264 -0
- package/tech_hub_skills/roles/devops/skills/07-release-management/README.md +264 -0
- package/tech_hub_skills/roles/devops/skills/08-monitoring-alerting/README.md +264 -0
- package/tech_hub_skills/roles/devops/skills/09-devsecops/README.md +265 -0
- package/tech_hub_skills/roles/finops/skills/01-cost-visibility/README.md +264 -0
- package/tech_hub_skills/roles/finops/skills/02-resource-tagging/README.md +264 -0
- package/tech_hub_skills/roles/finops/skills/03-budget-management/README.md +264 -0
- package/tech_hub_skills/roles/finops/skills/04-reserved-instances/README.md +264 -0
- package/tech_hub_skills/roles/finops/skills/05-spot-optimization/README.md +264 -0
- package/tech_hub_skills/roles/finops/skills/06-storage-tiering/README.md +264 -0
- package/tech_hub_skills/roles/finops/skills/07-compute-rightsizing/README.md +264 -0
- package/tech_hub_skills/roles/finops/skills/08-chargeback/README.md +264 -0
- package/tech_hub_skills/roles/ml-engineer/skills/01-mlops-pipeline/README.md +566 -0
- package/tech_hub_skills/roles/ml-engineer/skills/02-feature-engineering/README.md +655 -0
- package/tech_hub_skills/roles/ml-engineer/skills/03-model-training/README.md +704 -0
- package/tech_hub_skills/roles/ml-engineer/skills/04-model-serving/README.md +845 -0
- package/tech_hub_skills/roles/ml-engineer/skills/05-model-monitoring/README.md +874 -0
- package/tech_hub_skills/roles/mlops/skills/01-ml-pipeline-orchestration/README.md +264 -0
- package/tech_hub_skills/roles/mlops/skills/02-experiment-tracking/README.md +264 -0
- package/tech_hub_skills/roles/mlops/skills/03-model-registry/README.md +264 -0
- package/tech_hub_skills/roles/mlops/skills/04-feature-store/README.md +264 -0
- package/tech_hub_skills/roles/mlops/skills/05-model-deployment/README.md +264 -0
- package/tech_hub_skills/roles/mlops/skills/06-model-observability/README.md +264 -0
- package/tech_hub_skills/roles/mlops/skills/07-data-versioning/README.md +264 -0
- package/tech_hub_skills/roles/mlops/skills/08-ab-testing/README.md +264 -0
- package/tech_hub_skills/roles/mlops/skills/09-automated-retraining/README.md +264 -0
- package/tech_hub_skills/roles/platform-engineer/skills/01-internal-developer-platform/README.md +153 -0
- package/tech_hub_skills/roles/platform-engineer/skills/02-self-service-infrastructure/README.md +57 -0
- package/tech_hub_skills/roles/platform-engineer/skills/03-slo-sli-management/README.md +59 -0
- package/tech_hub_skills/roles/platform-engineer/skills/04-developer-experience/README.md +57 -0
- package/tech_hub_skills/roles/platform-engineer/skills/05-incident-management/README.md +73 -0
- package/tech_hub_skills/roles/platform-engineer/skills/06-capacity-management/README.md +59 -0
- package/tech_hub_skills/roles/product-designer/skills/01-requirements-discovery/README.md +407 -0
- package/tech_hub_skills/roles/product-designer/skills/02-user-research/README.md +382 -0
- package/tech_hub_skills/roles/product-designer/skills/03-brainstorming-ideation/README.md +437 -0
- package/tech_hub_skills/roles/product-designer/skills/04-ux-design/README.md +496 -0
- package/tech_hub_skills/roles/product-designer/skills/05-product-market-fit/README.md +376 -0
- package/tech_hub_skills/roles/product-designer/skills/06-stakeholder-management/README.md +412 -0
- package/tech_hub_skills/roles/security-architect/skills/01-pii-detection/README.md +319 -0
- package/tech_hub_skills/roles/security-architect/skills/02-threat-modeling/README.md +264 -0
- package/tech_hub_skills/roles/security-architect/skills/03-infrastructure-security/README.md +264 -0
- package/tech_hub_skills/roles/security-architect/skills/04-iam/README.md +264 -0
- package/tech_hub_skills/roles/security-architect/skills/05-application-security/README.md +264 -0
- package/tech_hub_skills/roles/security-architect/skills/06-secrets-management/README.md +264 -0
- package/tech_hub_skills/roles/security-architect/skills/07-security-monitoring/README.md +264 -0
- package/tech_hub_skills/roles/system-design/skills/01-architecture-patterns/README.md +337 -0
- package/tech_hub_skills/roles/system-design/skills/02-requirements-engineering/README.md +264 -0
- package/tech_hub_skills/roles/system-design/skills/03-scalability/README.md +264 -0
- package/tech_hub_skills/roles/system-design/skills/04-high-availability/README.md +264 -0
- package/tech_hub_skills/roles/system-design/skills/05-cost-optimization-design/README.md +264 -0
- package/tech_hub_skills/roles/system-design/skills/06-api-design/README.md +264 -0
- package/tech_hub_skills/roles/system-design/skills/07-observability-architecture/README.md +264 -0
- package/tech_hub_skills/roles/system-design/skills/08-process-automation/PROCESS_TEMPLATE.md +336 -0
- package/tech_hub_skills/roles/system-design/skills/08-process-automation/README.md +521 -0
- package/tech_hub_skills/skills/README.md +336 -0
- package/tech_hub_skills/skills/ai-engineer.md +104 -0
- package/tech_hub_skills/skills/azure.md +149 -0
- package/tech_hub_skills/skills/code-review.md +399 -0
- package/tech_hub_skills/skills/compliance-automation.md +747 -0
- package/tech_hub_skills/skills/data-engineer.md +113 -0
- package/tech_hub_skills/skills/data-governance.md +102 -0
- package/tech_hub_skills/skills/data-scientist.md +123 -0
- package/tech_hub_skills/skills/devops.md +160 -0
- package/tech_hub_skills/skills/docker.md +160 -0
- package/tech_hub_skills/skills/enterprise-dashboard.md +613 -0
- package/tech_hub_skills/skills/finops.md +184 -0
- package/tech_hub_skills/skills/ml-engineer.md +115 -0
- package/tech_hub_skills/skills/mlops.md +187 -0
- package/tech_hub_skills/skills/optimization-advisor.md +329 -0
- package/tech_hub_skills/skills/orchestrator.md +497 -0
- package/tech_hub_skills/skills/platform-engineer.md +102 -0
- package/tech_hub_skills/skills/process-automation.md +226 -0
- package/tech_hub_skills/skills/process-changelog.md +184 -0
- package/tech_hub_skills/skills/process-documentation.md +484 -0
- package/tech_hub_skills/skills/process-kanban.md +324 -0
- package/tech_hub_skills/skills/process-versioning.md +214 -0
- package/tech_hub_skills/skills/product-designer.md +104 -0
- package/tech_hub_skills/skills/project-starter.md +443 -0
- package/tech_hub_skills/skills/security-architect.md +135 -0
- package/tech_hub_skills/skills/system-design.md +126 -0
|
@@ -0,0 +1,777 @@
|
|
|
1
|
+
# Skill 6: LLM Evaluation & Benchmarking
|
|
2
|
+
|
|
3
|
+
## 🎯 Overview
|
|
4
|
+
Build comprehensive evaluation frameworks for LLM applications including automated testing, benchmark suites, A/B testing, and continuous quality monitoring for production systems.
|
|
5
|
+
|
|
6
|
+
## 🔗 Connections
|
|
7
|
+
- **Data Engineer**: Test dataset curation, evaluation metrics storage (de-01, de-03)
|
|
8
|
+
- **Security Architect**: Adversarial testing, safety evaluation (sa-08)
|
|
9
|
+
- **ML Engineer**: Model comparison, performance benchmarking (ml-03, ml-05)
|
|
10
|
+
- **MLOps**: Continuous evaluation, metric tracking (mo-04, mo-05)
|
|
11
|
+
- **FinOps**: Cost-quality tradeoff analysis, evaluation budget optimization (fo-01, fo-07)
|
|
12
|
+
- **DevOps**: Automated testing in CI/CD, regression detection (do-01, do-06)
|
|
13
|
+
- **Data Scientist**: Statistical analysis, experiment design (ds-01, ds-08)
|
|
14
|
+
|
|
15
|
+
## 🛠️ Tools Included
|
|
16
|
+
|
|
17
|
+
### 1. `llm_evaluator.py`
|
|
18
|
+
Comprehensive evaluation framework with multiple metrics (accuracy, coherence, relevance, safety).
|
|
19
|
+
|
|
20
|
+
### 2. `benchmark_runner.py`
|
|
21
|
+
Execute standard benchmarks (MMLU, HellaSwag, TruthfulQA) and custom task suites.
|
|
22
|
+
|
|
23
|
+
### 3. `ab_test_framework.py`
|
|
24
|
+
A/B testing infrastructure for comparing models, prompts, and system configurations.
|
|
25
|
+
|
|
26
|
+
### 4. `regression_detector.py`
|
|
27
|
+
Automated regression testing to catch quality degradation before production deployment.
|
|
28
|
+
|
|
29
|
+
### 5. `eval_dataset_builder.py`
|
|
30
|
+
Create and manage evaluation datasets with versioning and golden reference answers.
|
|
31
|
+
|
|
32
|
+
## 📊 Key Metrics
|
|
33
|
+
- Task accuracy and F1 score
|
|
34
|
+
- Response coherence and fluency
|
|
35
|
+
- Factual accuracy and hallucination rate
|
|
36
|
+
- Safety and bias scores
|
|
37
|
+
- Cost per evaluation
|
|
38
|
+
|
|
39
|
+
## 🚀 Quick Start
|
|
40
|
+
|
|
41
|
+
```python
|
|
42
|
+
from llm_evaluator import LLMEvaluator, EvaluationMetrics
|
|
43
|
+
from benchmark_runner import BenchmarkRunner
|
|
44
|
+
|
|
45
|
+
# Initialize evaluator
|
|
46
|
+
evaluator = LLMEvaluator(
|
|
47
|
+
metrics=[
|
|
48
|
+
"accuracy",
|
|
49
|
+
"coherence",
|
|
50
|
+
"factuality",
|
|
51
|
+
"safety",
|
|
52
|
+
"relevance"
|
|
53
|
+
]
|
|
54
|
+
)
|
|
55
|
+
|
|
56
|
+
# Load test dataset
|
|
57
|
+
test_data = evaluator.load_dataset(
|
|
58
|
+
name="customer_support_qa",
|
|
59
|
+
version="v1.2"
|
|
60
|
+
)
|
|
61
|
+
|
|
62
|
+
# Evaluate model
|
|
63
|
+
results = evaluator.evaluate(
|
|
64
|
+
model="claude-3-5-sonnet-20241022",
|
|
65
|
+
test_data=test_data,
|
|
66
|
+
num_samples=500
|
|
67
|
+
)
|
|
68
|
+
|
|
69
|
+
# Print results
|
|
70
|
+
print(f"Accuracy: {results.accuracy:.3f}")
|
|
71
|
+
print(f"Coherence: {results.coherence:.3f}")
|
|
72
|
+
print(f"Factuality: {results.factuality:.3f}")
|
|
73
|
+
print(f"Safety: {results.safety:.3f}")
|
|
74
|
+
print(f"Cost: ${results.total_cost:.4f}")
|
|
75
|
+
|
|
76
|
+
# Run standard benchmarks
|
|
77
|
+
benchmark = BenchmarkRunner()
|
|
78
|
+
benchmark_results = benchmark.run(
|
|
79
|
+
model="claude-3-5-sonnet-20241022",
|
|
80
|
+
benchmarks=["mmlu", "hellaswag", "truthfulqa"]
|
|
81
|
+
)
|
|
82
|
+
|
|
83
|
+
print(f"\nBenchmark Results:")
|
|
84
|
+
for name, score in benchmark_results.items():
|
|
85
|
+
print(f" {name}: {score:.2%}")
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
## 📚 Best Practices
|
|
89
|
+
|
|
90
|
+
### Cost Optimization (FinOps Integration)
|
|
91
|
+
|
|
92
|
+
1. **Optimize Evaluation Frequency**
|
|
93
|
+
- Run full evals only on significant changes
|
|
94
|
+
- Use smoke tests for minor updates
|
|
95
|
+
- Implement tiered evaluation (quick → comprehensive)
|
|
96
|
+
- Schedule heavy evaluations during off-peak hours
|
|
97
|
+
- Reference: FinOps fo-07 (AI/ML Cost Optimization)
|
|
98
|
+
|
|
99
|
+
2. **Sample-Based Evaluation**
|
|
100
|
+
- Use statistical sampling for large datasets
|
|
101
|
+
- Calculate confidence intervals
|
|
102
|
+
- Start with small samples, expand if needed
|
|
103
|
+
- Monitor sample size vs cost tradeoffs
|
|
104
|
+
- Reference: FinOps fo-03 (Budget Management)
|
|
105
|
+
|
|
106
|
+
3. **Cache Evaluation Results**
|
|
107
|
+
- Cache model outputs for test sets
|
|
108
|
+
- Reuse evaluations across metrics
|
|
109
|
+
- Implement incremental evaluation
|
|
110
|
+
- Track cache hit rates
|
|
111
|
+
- Reference: FinOps fo-01 (Cost Monitoring)
|
|
112
|
+
|
|
113
|
+
4. **Cost-Aware Benchmark Selection**
|
|
114
|
+
- Prioritize high-signal benchmarks
|
|
115
|
+
- Use lightweight metrics first
|
|
116
|
+
- Reserve expensive evals (human review) for critical cases
|
|
117
|
+
- Track evaluation cost per model
|
|
118
|
+
- Reference: FinOps fo-07 (AI/ML Cost Optimization)
|
|
119
|
+
|
|
120
|
+
### Security & Privacy (Security Architect Integration)
|
|
121
|
+
|
|
122
|
+
5. **Adversarial Testing**
|
|
123
|
+
- Test against prompt injection attacks
|
|
124
|
+
- Evaluate jailbreaking resistance
|
|
125
|
+
- Check for unsafe output generation
|
|
126
|
+
- Monitor for security regression
|
|
127
|
+
- Reference: Security Architect sa-08 (LLM Security)
|
|
128
|
+
|
|
129
|
+
6. **Privacy-Preserving Evaluation**
|
|
130
|
+
- Anonymize evaluation datasets
|
|
131
|
+
- Remove PII from test cases
|
|
132
|
+
- Secure storage of evaluation results
|
|
133
|
+
- Audit access to evaluation data
|
|
134
|
+
- Reference: Security Architect sa-01 (PII Detection), sa-06 (Data Governance)
|
|
135
|
+
|
|
136
|
+
7. **Safety Benchmark Suite**
|
|
137
|
+
- Evaluate toxic content generation
|
|
138
|
+
- Test bias across demographics
|
|
139
|
+
- Check compliance with safety policies
|
|
140
|
+
- Red team testing for edge cases
|
|
141
|
+
- Reference: Security Architect sa-08 (LLM Security)
|
|
142
|
+
|
|
143
|
+
### Data Quality & Governance (Data Engineer Integration)
|
|
144
|
+
|
|
145
|
+
8. **High-Quality Test Datasets**
|
|
146
|
+
- Curate diverse, representative test sets
|
|
147
|
+
- Include edge cases and failure modes
|
|
148
|
+
- Version test datasets with lineage
|
|
149
|
+
- Validate test data quality regularly
|
|
150
|
+
- Reference: Data Engineer de-03 (Data Quality)
|
|
151
|
+
|
|
152
|
+
9. **Evaluation Data Pipeline**
|
|
153
|
+
- Automate test dataset updates
|
|
154
|
+
- Track dataset version and provenance
|
|
155
|
+
- Implement data validation checks
|
|
156
|
+
- Monitor dataset drift over time
|
|
157
|
+
- Reference: Data Engineer de-01 (Data Ingestion), de-02 (ETL)
|
|
158
|
+
|
|
159
|
+
### Model Lifecycle Management (MLOps Integration)
|
|
160
|
+
|
|
161
|
+
10. **Continuous Evaluation**
|
|
162
|
+
- Run evals on every model deployment
|
|
163
|
+
- Track metrics across model versions
|
|
164
|
+
- Set quality gates for production
|
|
165
|
+
- Alert on metric degradation
|
|
166
|
+
- Reference: MLOps mo-04 (Monitoring)
|
|
167
|
+
|
|
168
|
+
11. **Evaluation Metrics Versioning**
|
|
169
|
+
- Version evaluation code and metrics
|
|
170
|
+
- Track metric definition changes
|
|
171
|
+
- Ensure reproducibility of results
|
|
172
|
+
- Maintain historical comparisons
|
|
173
|
+
- Reference: MLOps mo-01 (Model Registry), mo-03 (Versioning)
|
|
174
|
+
|
|
175
|
+
12. **Performance Regression Detection**
|
|
176
|
+
- Compare new models against baselines
|
|
177
|
+
- Statistical significance testing
|
|
178
|
+
- Automated rollback on regression
|
|
179
|
+
- Track performance trends over time
|
|
180
|
+
- Reference: MLOps mo-05 (Drift Detection)
|
|
181
|
+
|
|
182
|
+
### Deployment & Operations (DevOps Integration)
|
|
183
|
+
|
|
184
|
+
13. **Automated Testing in CI/CD**
|
|
185
|
+
- Run evaluations in deployment pipeline
|
|
186
|
+
- Fail deployments on quality regression
|
|
187
|
+
- Parallel evaluation execution
|
|
188
|
+
- Generate evaluation reports automatically
|
|
189
|
+
- Reference: DevOps do-01 (CI/CD), do-06 (Testing)
|
|
190
|
+
|
|
191
|
+
14. **Evaluation Infrastructure**
|
|
192
|
+
- Containerize evaluation workloads
|
|
193
|
+
- Distributed evaluation for speed
|
|
194
|
+
- Auto-scaling for benchmark runs
|
|
195
|
+
- Cost-optimized compute for evals
|
|
196
|
+
- Reference: DevOps do-03 (Containerization)
|
|
197
|
+
|
|
198
|
+
15. **Observability for Evaluations**
|
|
199
|
+
- Track evaluation job status
|
|
200
|
+
- Monitor evaluation latency and costs
|
|
201
|
+
- Alert on evaluation failures
|
|
202
|
+
- Dashboard for evaluation metrics
|
|
203
|
+
- Reference: DevOps do-08 (Monitoring & Observability)
|
|
204
|
+
|
|
205
|
+
### Azure-Specific Best Practices
|
|
206
|
+
|
|
207
|
+
16. **Azure ML for Evaluation**
|
|
208
|
+
- Use Azure ML pipelines for evaluations
|
|
209
|
+
- Track experiments in Azure ML workspace
|
|
210
|
+
- Store evaluation datasets in Azure Storage
|
|
211
|
+
- Visualize results in Azure ML studio
|
|
212
|
+
- Reference: Azure az-04 (AI/ML Services)
|
|
213
|
+
|
|
214
|
+
17. **Cost-Effective Evaluation Compute**
|
|
215
|
+
- Use spot instances for batch evaluations
|
|
216
|
+
- Right-size VM instances for workload
|
|
217
|
+
- Implement auto-shutdown after evals
|
|
218
|
+
- Monitor compute costs in Azure Cost Management
|
|
219
|
+
- Reference: Azure az-09 (Cost Management)
|
|
220
|
+
|
|
221
|
+
## 💰 Cost Optimization Examples
|
|
222
|
+
|
|
223
|
+
### Tiered Evaluation Strategy
|
|
224
|
+
```python
|
|
225
|
+
from llm_evaluator import LLMEvaluator, EvaluationTier
|
|
226
|
+
|
|
227
|
+
class CostOptimizedEvaluator:
|
|
228
|
+
def __init__(self):
|
|
229
|
+
# Quick smoke test (low cost)
|
|
230
|
+
self.smoke_test = LLMEvaluator(
|
|
231
|
+
metrics=["basic_accuracy"],
|
|
232
|
+
sample_size=50,
|
|
233
|
+
cost_per_run=0.10
|
|
234
|
+
)
|
|
235
|
+
|
|
236
|
+
# Standard evaluation (medium cost)
|
|
237
|
+
self.standard_eval = LLMEvaluator(
|
|
238
|
+
metrics=["accuracy", "coherence", "relevance"],
|
|
239
|
+
sample_size=200,
|
|
240
|
+
cost_per_run=1.50
|
|
241
|
+
)
|
|
242
|
+
|
|
243
|
+
# Comprehensive evaluation (high cost)
|
|
244
|
+
self.comprehensive_eval = LLMEvaluator(
|
|
245
|
+
metrics=[
|
|
246
|
+
"accuracy", "coherence", "relevance",
|
|
247
|
+
"factuality", "safety", "bias"
|
|
248
|
+
],
|
|
249
|
+
sample_size=1000,
|
|
250
|
+
cost_per_run=15.00
|
|
251
|
+
)
|
|
252
|
+
|
|
253
|
+
def evaluate_with_budget(self, model: str, change_type: str):
|
|
254
|
+
"""Select evaluation tier based on change type."""
|
|
255
|
+
|
|
256
|
+
if change_type == "minor_update":
|
|
257
|
+
# Quick smoke test for minor changes
|
|
258
|
+
return self.smoke_test.evaluate(model)
|
|
259
|
+
|
|
260
|
+
elif change_type == "prompt_change":
|
|
261
|
+
# Standard eval for prompt/config changes
|
|
262
|
+
return self.standard_eval.evaluate(model)
|
|
263
|
+
|
|
264
|
+
elif change_type == "model_upgrade":
|
|
265
|
+
# Comprehensive eval for major changes
|
|
266
|
+
return self.comprehensive_eval.evaluate(model)
|
|
267
|
+
|
|
268
|
+
# Usage
|
|
269
|
+
evaluator = CostOptimizedEvaluator()
|
|
270
|
+
|
|
271
|
+
# Minor update: $0.10
|
|
272
|
+
smoke_results = evaluator.evaluate_with_budget(
|
|
273
|
+
model="claude-3-5-sonnet-20241022",
|
|
274
|
+
change_type="minor_update"
|
|
275
|
+
)
|
|
276
|
+
|
|
277
|
+
# Major update: $15.00 (but only when needed)
|
|
278
|
+
full_results = evaluator.evaluate_with_budget(
|
|
279
|
+
model="claude-opus-4-5-20251101",
|
|
280
|
+
change_type="model_upgrade"
|
|
281
|
+
)
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
### Statistical Sampling for Cost Reduction
|
|
285
|
+
```python
|
|
286
|
+
import numpy as np
|
|
287
|
+
from scipy import stats
|
|
288
|
+
|
|
289
|
+
class SamplingEvaluator:
|
|
290
|
+
def __init__(self, full_dataset_size: int = 10000):
|
|
291
|
+
self.full_dataset = self.load_full_dataset()
|
|
292
|
+
self.full_dataset_size = full_dataset_size
|
|
293
|
+
|
|
294
|
+
def calculate_sample_size(
|
|
295
|
+
self,
|
|
296
|
+
confidence_level: float = 0.95,
|
|
297
|
+
margin_of_error: float = 0.02
|
|
298
|
+
) -> int:
|
|
299
|
+
"""Calculate minimum sample size for statistical validity."""
|
|
300
|
+
# For proportion estimation
|
|
301
|
+
z_score = stats.norm.ppf((1 + confidence_level) / 2)
|
|
302
|
+
p = 0.5 # Conservative estimate (maximum variance)
|
|
303
|
+
|
|
304
|
+
n = (z_score ** 2 * p * (1 - p)) / (margin_of_error ** 2)
|
|
305
|
+
|
|
306
|
+
# Finite population correction
|
|
307
|
+
n_adjusted = n / (1 + ((n - 1) / self.full_dataset_size))
|
|
308
|
+
|
|
309
|
+
return int(np.ceil(n_adjusted))
|
|
310
|
+
|
|
311
|
+
def evaluate_with_sampling(self, model: str, confidence: float = 0.95):
|
|
312
|
+
"""Evaluate with statistically valid sampling."""
|
|
313
|
+
# Calculate required sample size
|
|
314
|
+
sample_size = self.calculate_sample_size(
|
|
315
|
+
confidence_level=confidence,
|
|
316
|
+
margin_of_error=0.02 # ±2% margin of error
|
|
317
|
+
)
|
|
318
|
+
|
|
319
|
+
print(f"Sample size: {sample_size} (vs {self.full_dataset_size} full)")
|
|
320
|
+
|
|
321
|
+
# Stratified sampling for better representation
|
|
322
|
+
sample = self.stratified_sample(self.full_dataset, sample_size)
|
|
323
|
+
|
|
324
|
+
# Evaluate on sample
|
|
325
|
+
results = self.evaluate(model, sample)
|
|
326
|
+
|
|
327
|
+
# Calculate confidence intervals
|
|
328
|
+
ci_lower, ci_upper = self.calculate_confidence_interval(
|
|
329
|
+
results.accuracy,
|
|
330
|
+
sample_size,
|
|
331
|
+
confidence
|
|
332
|
+
)
|
|
333
|
+
|
|
334
|
+
return {
|
|
335
|
+
"accuracy": results.accuracy,
|
|
336
|
+
"confidence_interval": (ci_lower, ci_upper),
|
|
337
|
+
"sample_size": sample_size,
|
|
338
|
+
"cost_saved": self.calculate_cost_savings(sample_size)
|
|
339
|
+
}
|
|
340
|
+
|
|
341
|
+
def calculate_cost_savings(self, sample_size: int) -> float:
|
|
342
|
+
"""Calculate cost savings from sampling."""
|
|
343
|
+
full_cost = self.full_dataset_size * 0.002 # $0.002 per evaluation
|
|
344
|
+
sample_cost = sample_size * 0.002
|
|
345
|
+
|
|
346
|
+
savings = full_cost - sample_cost
|
|
347
|
+
savings_pct = (savings / full_cost) * 100
|
|
348
|
+
|
|
349
|
+
print(f"Cost savings: ${savings:.2f} ({savings_pct:.1f}%)")
|
|
350
|
+
|
|
351
|
+
return savings
|
|
352
|
+
|
|
353
|
+
# Usage
|
|
354
|
+
evaluator = SamplingEvaluator(full_dataset_size=10000)
|
|
355
|
+
|
|
356
|
+
# Evaluate with 95% confidence, ±2% margin of error
|
|
357
|
+
# Sample size: ~2400 (vs 10000 full)
|
|
358
|
+
# Cost: $4.80 (vs $20.00 full) → 76% savings
|
|
359
|
+
results = evaluator.evaluate_with_sampling(
|
|
360
|
+
model="claude-3-5-sonnet-20241022",
|
|
361
|
+
confidence=0.95
|
|
362
|
+
)
|
|
363
|
+
|
|
364
|
+
print(f"Accuracy: {results['accuracy']:.3f} "
|
|
365
|
+
f"±{(results['confidence_interval'][1] - results['accuracy']):.3f}")
|
|
366
|
+
```
|
|
367
|
+
|
|
368
|
+
### Cached Evaluation Results
|
|
369
|
+
```python
|
|
370
|
+
from eval_cache import EvaluationCache
|
|
371
|
+
import hashlib
|
|
372
|
+
|
|
373
|
+
class CachedEvaluator:
|
|
374
|
+
def __init__(self):
|
|
375
|
+
self.evaluator = LLMEvaluator()
|
|
376
|
+
self.cache = EvaluationCache(ttl_days=30)
|
|
377
|
+
|
|
378
|
+
def evaluate(self, model: str, test_data: dict, metrics: List[str]):
|
|
379
|
+
"""Evaluate with result caching."""
|
|
380
|
+
# Generate cache key from model, data, and metrics
|
|
381
|
+
cache_key = self._generate_cache_key(model, test_data, metrics)
|
|
382
|
+
|
|
383
|
+
# Check cache
|
|
384
|
+
cached_result = self.cache.get(cache_key)
|
|
385
|
+
if cached_result:
|
|
386
|
+
print("✅ Cache hit - evaluation cost saved!")
|
|
387
|
+
return cached_result
|
|
388
|
+
|
|
389
|
+
# Run evaluation
|
|
390
|
+
print("🔄 Cache miss - running evaluation...")
|
|
391
|
+
result = self.evaluator.evaluate(
|
|
392
|
+
model=model,
|
|
393
|
+
test_data=test_data,
|
|
394
|
+
metrics=metrics
|
|
395
|
+
)
|
|
396
|
+
|
|
397
|
+
# Cache the result
|
|
398
|
+
self.cache.set(cache_key, result)
|
|
399
|
+
|
|
400
|
+
return result
|
|
401
|
+
|
|
402
|
+
def _generate_cache_key(
|
|
403
|
+
self,
|
|
404
|
+
model: str,
|
|
405
|
+
test_data: dict,
|
|
406
|
+
metrics: List[str]
|
|
407
|
+
) -> str:
|
|
408
|
+
"""Generate unique cache key."""
|
|
409
|
+
# Hash test data to detect changes
|
|
410
|
+
data_hash = hashlib.md5(
|
|
411
|
+
str(test_data).encode()
|
|
412
|
+
).hexdigest()
|
|
413
|
+
|
|
414
|
+
# Combine all components
|
|
415
|
+
key_components = f"{model}:{data_hash}:{','.join(sorted(metrics))}"
|
|
416
|
+
|
|
417
|
+
return hashlib.md5(key_components.encode()).hexdigest()
|
|
418
|
+
|
|
419
|
+
# Usage
|
|
420
|
+
evaluator = CachedEvaluator()
|
|
421
|
+
|
|
422
|
+
# First run: Cache miss, costs $5.00
|
|
423
|
+
results1 = evaluator.evaluate(
|
|
424
|
+
model="claude-3-5-sonnet-20241022",
|
|
425
|
+
test_data=test_dataset,
|
|
426
|
+
metrics=["accuracy", "coherence"]
|
|
427
|
+
)
|
|
428
|
+
|
|
429
|
+
# Second run: Cache hit, costs $0.00
|
|
430
|
+
results2 = evaluator.evaluate(
|
|
431
|
+
model="claude-3-5-sonnet-20241022",
|
|
432
|
+
test_data=test_dataset,
|
|
433
|
+
metrics=["accuracy", "coherence"]
|
|
434
|
+
)
|
|
435
|
+
|
|
436
|
+
# After prompt change: Run only new metrics, reuse cached ones
|
|
437
|
+
results3 = evaluator.evaluate(
|
|
438
|
+
model="claude-3-5-sonnet-20241022",
|
|
439
|
+
test_data=test_dataset,
|
|
440
|
+
metrics=["accuracy", "coherence", "safety"] # +safety (new)
|
|
441
|
+
)
|
|
442
|
+
# Cost: Only safety metric evaluation ($1.50 vs $5.00)
|
|
443
|
+
```
|
|
444
|
+
|
|
445
|
+
## 🔒 Security Best Practices Examples
|
|
446
|
+
|
|
447
|
+
### Adversarial Testing Suite
|
|
448
|
+
```python
|
|
449
|
+
from adversarial_tester import AdversarialTester
|
|
450
|
+
|
|
451
|
+
class SecurityEvaluator:
|
|
452
|
+
def __init__(self):
|
|
453
|
+
self.adversarial_tester = AdversarialTester()
|
|
454
|
+
|
|
455
|
+
def evaluate_security(self, model: str):
|
|
456
|
+
"""Comprehensive security evaluation."""
|
|
457
|
+
results = {}
|
|
458
|
+
|
|
459
|
+
# 1. Prompt injection testing
|
|
460
|
+
injection_tests = self.adversarial_tester.test_prompt_injection(
|
|
461
|
+
model=model,
|
|
462
|
+
attack_patterns=[
|
|
463
|
+
"ignore_previous",
|
|
464
|
+
"role_switch",
|
|
465
|
+
"payload_splitting",
|
|
466
|
+
"virtualization"
|
|
467
|
+
]
|
|
468
|
+
)
|
|
469
|
+
results["prompt_injection_resistance"] = injection_tests.block_rate
|
|
470
|
+
|
|
471
|
+
# 2. Jailbreaking attempts
|
|
472
|
+
jailbreak_tests = self.adversarial_tester.test_jailbreaking(
|
|
473
|
+
model=model,
|
|
474
|
+
techniques=[
|
|
475
|
+
"do_anything_now",
|
|
476
|
+
"character_roleplay",
|
|
477
|
+
"hypothetical_scenario"
|
|
478
|
+
]
|
|
479
|
+
)
|
|
480
|
+
results["jailbreak_resistance"] = jailbreak_tests.block_rate
|
|
481
|
+
|
|
482
|
+
# 3. PII leakage testing
|
|
483
|
+
pii_tests = self.adversarial_tester.test_pii_leakage(
|
|
484
|
+
model=model,
|
|
485
|
+
pii_types=["ssn", "credit_card", "medical_record"]
|
|
486
|
+
)
|
|
487
|
+
results["pii_protection"] = 1 - pii_tests.leakage_rate
|
|
488
|
+
|
|
489
|
+
# 4. Toxic content generation
|
|
490
|
+
toxicity_tests = self.adversarial_tester.test_toxic_generation(
|
|
491
|
+
model=model,
|
|
492
|
+
categories=["hate", "violence", "sexual", "harassment"]
|
|
493
|
+
)
|
|
494
|
+
results["safety_filter_effectiveness"] = toxicity_tests.block_rate
|
|
495
|
+
|
|
496
|
+
return results
|
|
497
|
+
|
|
498
|
+
# Usage
|
|
499
|
+
security_eval = SecurityEvaluator()
|
|
500
|
+
|
|
501
|
+
security_scores = security_eval.evaluate_security(
|
|
502
|
+
model="claude-3-5-sonnet-20241022"
|
|
503
|
+
)
|
|
504
|
+
|
|
505
|
+
print("\n🔒 Security Evaluation Results:")
|
|
506
|
+
for metric, score in security_scores.items():
|
|
507
|
+
status = "✅ PASS" if score > 0.95 else "⚠️ REVIEW"
|
|
508
|
+
print(f" {metric}: {score:.2%} {status}")
|
|
509
|
+
|
|
510
|
+
# Fail deployment if security scores are too low
|
|
511
|
+
if security_scores["prompt_injection_resistance"] < 0.90:
|
|
512
|
+
raise SecurityError("Insufficient prompt injection protection")
|
|
513
|
+
```
|
|
514
|
+
|
|
515
|
+
### Safety Benchmark Suite
|
|
516
|
+
```python
|
|
517
|
+
from safety_evaluator import SafetyEvaluator
|
|
518
|
+
|
|
519
|
+
class ComprehensiveSafetyEval:
|
|
520
|
+
def __init__(self):
|
|
521
|
+
self.safety_eval = SafetyEvaluator()
|
|
522
|
+
|
|
523
|
+
def run_safety_benchmarks(self, model: str):
|
|
524
|
+
"""Run comprehensive safety evaluation."""
|
|
525
|
+
results = {}
|
|
526
|
+
|
|
527
|
+
# 1. Bias evaluation across demographics
|
|
528
|
+
bias_results = self.safety_eval.evaluate_bias(
|
|
529
|
+
model=model,
|
|
530
|
+
dimensions=[
|
|
531
|
+
"gender",
|
|
532
|
+
"race",
|
|
533
|
+
"religion",
|
|
534
|
+
"age",
|
|
535
|
+
"nationality"
|
|
536
|
+
],
|
|
537
|
+
test_scenarios=1000
|
|
538
|
+
)
|
|
539
|
+
results["bias_scores"] = bias_results
|
|
540
|
+
|
|
541
|
+
# 2. Toxicity evaluation
|
|
542
|
+
toxicity_results = self.safety_eval.evaluate_toxicity(
|
|
543
|
+
model=model,
|
|
544
|
+
categories=[
|
|
545
|
+
"severe_toxicity",
|
|
546
|
+
"obscene",
|
|
547
|
+
"threat",
|
|
548
|
+
"insult"
|
|
549
|
+
]
|
|
550
|
+
)
|
|
551
|
+
results["toxicity_scores"] = toxicity_results
|
|
552
|
+
|
|
553
|
+
# 3. Truthfulness evaluation
|
|
554
|
+
truthfulness_results = self.safety_eval.evaluate_truthfulness(
|
|
555
|
+
model=model,
|
|
556
|
+
benchmark="truthfulqa",
|
|
557
|
+
categories=["health", "finance", "law"]
|
|
558
|
+
)
|
|
559
|
+
results["truthfulness_score"] = truthfulness_results.accuracy
|
|
560
|
+
|
|
561
|
+
# 4. Compliance testing
|
|
562
|
+
compliance_results = self.safety_eval.evaluate_compliance(
|
|
563
|
+
model=model,
|
|
564
|
+
standards=["hipaa", "gdpr", "financial_advice"]
|
|
565
|
+
)
|
|
566
|
+
results["compliance_scores"] = compliance_results
|
|
567
|
+
|
|
568
|
+
return results
|
|
569
|
+
|
|
570
|
+
def generate_safety_report(self, results: dict):
|
|
571
|
+
"""Generate comprehensive safety report."""
|
|
572
|
+
report = "# Safety Evaluation Report\n\n"
|
|
573
|
+
|
|
574
|
+
# Bias scores
|
|
575
|
+
report += "## Bias Evaluation\n"
|
|
576
|
+
for dimension, score in results["bias_scores"].items():
|
|
577
|
+
report += f"- {dimension}: {score:.3f}\n"
|
|
578
|
+
|
|
579
|
+
# Toxicity scores
|
|
580
|
+
report += "\n## Toxicity Evaluation\n"
|
|
581
|
+
for category, score in results["toxicity_scores"].items():
|
|
582
|
+
report += f"- {category}: {score:.3f}\n"
|
|
583
|
+
|
|
584
|
+
# Truthfulness
|
|
585
|
+
report += f"\n## Truthfulness\n"
|
|
586
|
+
report += f"- Score: {results['truthfulness_score']:.3f}\n"
|
|
587
|
+
|
|
588
|
+
# Compliance
|
|
589
|
+
report += "\n## Compliance\n"
|
|
590
|
+
for standard, passed in results["compliance_scores"].items():
|
|
591
|
+
status = "✅ PASS" if passed else "❌ FAIL"
|
|
592
|
+
report += f"- {standard}: {status}\n"
|
|
593
|
+
|
|
594
|
+
return report
|
|
595
|
+
|
|
596
|
+
# Usage
|
|
597
|
+
safety_eval = ComprehensiveSafetyEval()
|
|
598
|
+
|
|
599
|
+
results = safety_eval.run_safety_benchmarks(
|
|
600
|
+
model="claude-3-5-sonnet-20241022"
|
|
601
|
+
)
|
|
602
|
+
|
|
603
|
+
report = safety_eval.generate_safety_report(results)
|
|
604
|
+
print(report)
|
|
605
|
+
|
|
606
|
+
# Save report for compliance
|
|
607
|
+
with open("safety_evaluation_report.md", "w") as f:
|
|
608
|
+
f.write(report)
|
|
609
|
+
```
|
|
610
|
+
|
|
611
|
+
## 📊 Enhanced Metrics & Monitoring
|
|
612
|
+
|
|
613
|
+
| Metric Category | Metric | Target | Tool |
|
|
614
|
+
|-----------------|--------|--------|------|
|
|
615
|
+
| **Task Performance** | Accuracy | >0.90 | Custom evaluator |
|
|
616
|
+
| | F1 Score | >0.85 | scikit-learn |
|
|
617
|
+
| | BLEU score (generation) | >0.40 | nltk |
|
|
618
|
+
| | ROUGE-L (summarization) | >0.45 | rouge-score |
|
|
619
|
+
| **Benchmark Scores** | MMLU (reasoning) | >80% | Benchmark runner |
|
|
620
|
+
| | HellaSwag (common sense) | >85% | Benchmark runner |
|
|
621
|
+
| | TruthfulQA (truthfulness) | >75% | Benchmark runner |
|
|
622
|
+
| **Quality Metrics** | Coherence score | >0.85 | Custom evaluator |
|
|
623
|
+
| | Relevance score | >0.90 | Custom evaluator |
|
|
624
|
+
| | Hallucination rate | <5% | Factuality checker |
|
|
625
|
+
| **Safety & Bias** | Toxicity score | <0.02 | Perspective API |
|
|
626
|
+
| | Bias score (demographics) | <0.10 | Fairness evaluator |
|
|
627
|
+
| | Safety filter pass rate | >0.98 | Safety evaluator |
|
|
628
|
+
| **Costs** | Evaluation cost per model | <$50 | Cost tracker |
|
|
629
|
+
| | Cost per test sample | <$0.05 | Cost analyzer |
|
|
630
|
+
| | Cache hit rate | >50% | Eval cache |
|
|
631
|
+
| **Performance** | Evaluation runtime (1K samples) | <30min | Time tracker |
|
|
632
|
+
| | Throughput (samples/sec) | >5 | Benchmark runner |
|
|
633
|
+
|
|
634
|
+
## 🚀 Deployment Pipeline
|
|
635
|
+
|
|
636
|
+
### CI/CD with Automated Evaluation
|
|
637
|
+
```yaml
|
|
638
|
+
# .github/workflows/llm-evaluation-pipeline.yml
|
|
639
|
+
name: LLM Evaluation Pipeline
|
|
640
|
+
|
|
641
|
+
on:
|
|
642
|
+
pull_request:
|
|
643
|
+
paths:
|
|
644
|
+
- 'models/**'
|
|
645
|
+
- 'prompts/**'
|
|
646
|
+
push:
|
|
647
|
+
branches:
|
|
648
|
+
- main
|
|
649
|
+
|
|
650
|
+
jobs:
|
|
651
|
+
smoke-test:
|
|
652
|
+
runs-on: ubuntu-latest
|
|
653
|
+
steps:
|
|
654
|
+
- name: Quick smoke test (50 samples)
|
|
655
|
+
run: python scripts/run_smoke_test.py --samples 50
|
|
656
|
+
|
|
657
|
+
- name: Check basic accuracy
|
|
658
|
+
run: |
|
|
659
|
+
python scripts/check_accuracy.py --min-threshold 0.80
|
|
660
|
+
|
|
661
|
+
standard-evaluation:
|
|
662
|
+
needs: smoke-test
|
|
663
|
+
runs-on: ubuntu-latest
|
|
664
|
+
steps:
|
|
665
|
+
- name: Run standard evaluation (200 samples)
|
|
666
|
+
run: |
|
|
667
|
+
python scripts/run_evaluation.py \
|
|
668
|
+
--metrics accuracy,coherence,relevance \
|
|
669
|
+
--samples 200
|
|
670
|
+
|
|
671
|
+
- name: Check for regression
|
|
672
|
+
run: |
|
|
673
|
+
python scripts/check_regression.py \
|
|
674
|
+
--baseline-version v1.0 \
|
|
675
|
+
--max-degradation 0.05
|
|
676
|
+
|
|
677
|
+
- name: Generate evaluation report
|
|
678
|
+
run: python scripts/generate_eval_report.py
|
|
679
|
+
|
|
680
|
+
- name: Upload results to MLflow
|
|
681
|
+
run: python scripts/upload_to_mlflow.py
|
|
682
|
+
|
|
683
|
+
comprehensive-evaluation:
|
|
684
|
+
needs: standard-evaluation
|
|
685
|
+
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
|
|
686
|
+
runs-on: ubuntu-latest
|
|
687
|
+
steps:
|
|
688
|
+
- name: Run comprehensive evaluation (1000 samples)
|
|
689
|
+
run: |
|
|
690
|
+
python scripts/run_comprehensive_eval.py \
|
|
691
|
+
--metrics all \
|
|
692
|
+
--samples 1000
|
|
693
|
+
|
|
694
|
+
- name: Run benchmark suite
|
|
695
|
+
run: python scripts/run_benchmarks.py --benchmarks mmlu,hellaswag,truthfulqa
|
|
696
|
+
|
|
697
|
+
- name: Security evaluation
|
|
698
|
+
run: python scripts/run_security_eval.py
|
|
699
|
+
|
|
700
|
+
- name: Generate final report
|
|
701
|
+
run: python scripts/generate_final_report.py
|
|
702
|
+
|
|
703
|
+
- name: Quality gate check
|
|
704
|
+
run: |
|
|
705
|
+
python scripts/quality_gate.py \
|
|
706
|
+
--min-accuracy 0.90 \
|
|
707
|
+
--max-toxicity 0.02 \
|
|
708
|
+
--min-safety 0.98
|
|
709
|
+
|
|
710
|
+
deploy-if-passing:
|
|
711
|
+
needs: comprehensive-evaluation
|
|
712
|
+
runs-on: ubuntu-latest
|
|
713
|
+
environment: production
|
|
714
|
+
steps:
|
|
715
|
+
- name: Deploy to production
|
|
716
|
+
if: success()
|
|
717
|
+
run: python scripts/deploy_model.py --environment production
|
|
718
|
+
|
|
719
|
+
- name: Monitor post-deployment
|
|
720
|
+
run: python scripts/monitor_production.py --duration 2h
|
|
721
|
+
```
|
|
722
|
+
|
|
723
|
+
## 🔄 Integration Workflow
|
|
724
|
+
|
|
725
|
+
### End-to-End Evaluation Pipeline with All Roles
|
|
726
|
+
```
|
|
727
|
+
1. Code/Model Change Committed
|
|
728
|
+
↓
|
|
729
|
+
2. Trigger CI/CD Pipeline (do-01)
|
|
730
|
+
↓
|
|
731
|
+
3. Load Test Dataset (de-01)
|
|
732
|
+
↓
|
|
733
|
+
4. Validate Test Data Quality (de-03)
|
|
734
|
+
↓
|
|
735
|
+
5. Run Smoke Test (50 samples) (ai-06)
|
|
736
|
+
↓
|
|
737
|
+
6. Basic Regression Check (mo-05)
|
|
738
|
+
↓
|
|
739
|
+
7. Standard Evaluation (200 samples) (ai-06)
|
|
740
|
+
↓
|
|
741
|
+
8. Cost Tracking (fo-01)
|
|
742
|
+
↓
|
|
743
|
+
9. Security Evaluation (sa-08)
|
|
744
|
+
↓
|
|
745
|
+
10. Bias & Safety Testing (ds-01)
|
|
746
|
+
↓
|
|
747
|
+
11. Statistical Significance Test (ds-08)
|
|
748
|
+
↓
|
|
749
|
+
12. Quality Gate Check (mo-04)
|
|
750
|
+
↓
|
|
751
|
+
13. Comprehensive Evaluation (if passing) (ai-06)
|
|
752
|
+
↓
|
|
753
|
+
14. Benchmark Suite Execution (ai-06)
|
|
754
|
+
↓
|
|
755
|
+
15. Generate Evaluation Report (ai-06)
|
|
756
|
+
↓
|
|
757
|
+
16. Upload Metrics to MLflow (mo-01)
|
|
758
|
+
↓
|
|
759
|
+
17. Final Quality Gate (mo-04)
|
|
760
|
+
↓
|
|
761
|
+
18. Deploy if All Checks Pass (do-01)
|
|
762
|
+
↓
|
|
763
|
+
19. Post-Deployment Monitoring (mo-04)
|
|
764
|
+
↓
|
|
765
|
+
20. Continuous Evaluation in Production (mo-04)
|
|
766
|
+
```
|
|
767
|
+
|
|
768
|
+
## 🎯 Quick Wins
|
|
769
|
+
|
|
770
|
+
1. **Implement smoke tests** - Catch major regressions quickly with 50-sample tests
|
|
771
|
+
2. **Use statistical sampling** - 75% cost reduction with valid confidence intervals
|
|
772
|
+
3. **Cache evaluation results** - Reuse evaluations across multiple metrics
|
|
773
|
+
4. **Set up automated regression testing** - Block deployments on quality degradation
|
|
774
|
+
5. **Run tiered evaluations** - Quick tests for minor changes, comprehensive for major
|
|
775
|
+
6. **Add security evaluation** - Test against adversarial attacks before production
|
|
776
|
+
7. **Track evaluation costs** - Monitor and optimize evaluation budget
|
|
777
|
+
8. **Use distributed evaluation** - Parallelize for 10x faster benchmark execution
|