@patricio0312rev/skillset 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +29 -0
- package/LICENSE +21 -0
- package/README.md +176 -0
- package/bin/cli.js +37 -0
- package/package.json +55 -0
- package/src/commands/init.js +301 -0
- package/src/index.js +168 -0
- package/src/lib/config.js +200 -0
- package/src/lib/generator.js +166 -0
- package/src/utils/display.js +95 -0
- package/src/utils/readme.js +196 -0
- package/src/utils/tool-specific.js +233 -0
- package/templates/ai-engineering/agent-orchestration-planner/ SKILL.md +266 -0
- package/templates/ai-engineering/cost-latency-optimizer/ SKILL.md +270 -0
- package/templates/ai-engineering/doc-to-vector-dataset-generator/ SKILL.md +239 -0
- package/templates/ai-engineering/evaluation-harness/ SKILL.md +219 -0
- package/templates/ai-engineering/guardrails-safety-filter-builder/ SKILL.md +226 -0
- package/templates/ai-engineering/llm-debugger/ SKILL.md +283 -0
- package/templates/ai-engineering/prompt-regression-tester/ SKILL.md +216 -0
- package/templates/ai-engineering/prompt-template-builder/ SKILL.md +393 -0
- package/templates/ai-engineering/rag-pipeline-builder/ SKILL.md +244 -0
- package/templates/ai-engineering/tool-function-schema-designer/ SKILL.md +219 -0
- package/templates/architecture/adr-writer/ SKILL.md +250 -0
- package/templates/architecture/api-versioning-deprecation-planner/ SKILL.md +331 -0
- package/templates/architecture/domain-model-boundaries-mapper/ SKILL.md +300 -0
- package/templates/architecture/migration-planner/ SKILL.md +376 -0
- package/templates/architecture/performance-budget-setter/ SKILL.md +318 -0
- package/templates/architecture/reliability-strategy-builder/ SKILL.md +286 -0
- package/templates/architecture/rfc-generator/ SKILL.md +362 -0
- package/templates/architecture/scalability-playbook/ SKILL.md +279 -0
- package/templates/architecture/system-design-generator/ SKILL.md +339 -0
- package/templates/architecture/tech-debt-prioritizer/ SKILL.md +329 -0
- package/templates/backend/api-contract-normalizer/ SKILL.md +487 -0
- package/templates/backend/api-endpoint-generator/ SKILL.md +415 -0
- package/templates/backend/auth-module-builder/ SKILL.md +99 -0
- package/templates/backend/background-jobs-designer/ SKILL.md +166 -0
- package/templates/backend/caching-strategist/ SKILL.md +190 -0
- package/templates/backend/error-handling-standardizer/ SKILL.md +174 -0
- package/templates/backend/rate-limiting-abuse-protection/ SKILL.md +147 -0
- package/templates/backend/rbac-permissions-builder/ SKILL.md +158 -0
- package/templates/backend/service-layer-extractor/ SKILL.md +269 -0
- package/templates/backend/webhook-receiver-hardener/ SKILL.md +211 -0
- package/templates/ci-cd/artifact-sbom-publisher/ SKILL.md +236 -0
- package/templates/ci-cd/caching-strategy-optimizer/ SKILL.md +195 -0
- package/templates/ci-cd/deployment-checklist-generator/ SKILL.md +381 -0
- package/templates/ci-cd/github-actions-pipeline-creator/ SKILL.md +348 -0
- package/templates/ci-cd/monorepo-ci-optimizer/ SKILL.md +298 -0
- package/templates/ci-cd/preview-environments-builder/ SKILL.md +187 -0
- package/templates/ci-cd/quality-gates-enforcer/ SKILL.md +342 -0
- package/templates/ci-cd/release-automation-builder/ SKILL.md +281 -0
- package/templates/ci-cd/rollback-workflow-builder/ SKILL.md +372 -0
- package/templates/ci-cd/secrets-env-manager/ SKILL.md +242 -0
- package/templates/db-management/backup-restore-runbook-generator/ SKILL.md +505 -0
- package/templates/db-management/data-integrity-auditor/ SKILL.md +505 -0
- package/templates/db-management/data-retention-archiving-planner/ SKILL.md +430 -0
- package/templates/db-management/data-seeding-fixtures-builder/ SKILL.md +375 -0
- package/templates/db-management/db-performance-watchlist/ SKILL.md +425 -0
- package/templates/db-management/etl-sync-job-builder/ SKILL.md +457 -0
- package/templates/db-management/multi-tenant-safety-checker/ SKILL.md +398 -0
- package/templates/db-management/prisma-migration-assistant/ SKILL.md +379 -0
- package/templates/db-management/schema-consistency-checker/ SKILL.md +440 -0
- package/templates/db-management/sql-query-optimizer/ SKILL.md +324 -0
- package/templates/foundation/changelog-writer/ SKILL.md +431 -0
- package/templates/foundation/code-formatter-installer/ SKILL.md +320 -0
- package/templates/foundation/codebase-summarizer/ SKILL.md +360 -0
- package/templates/foundation/dependency-doctor/ SKILL.md +163 -0
- package/templates/foundation/dev-environment-bootstrapper/ SKILL.md +259 -0
- package/templates/foundation/dev-onboarding-builder/ SKILL.md +556 -0
- package/templates/foundation/docs-starter-kit/ SKILL.md +574 -0
- package/templates/foundation/explaining-code/SKILL.md +13 -0
- package/templates/foundation/git-hygiene-enforcer/ SKILL.md +455 -0
- package/templates/foundation/project-scaffolder/ SKILL.md +65 -0
- package/templates/foundation/project-scaffolder/references/templates.md +126 -0
- package/templates/foundation/repo-structure-linter/ SKILL.md +0 -0
- package/templates/foundation/repo-structure-linter/references/conventions.md +98 -0
- package/templates/frontend/animation-micro-interaction-pack/ SKILL.md +41 -0
- package/templates/frontend/component-scaffold-generator/ SKILL.md +562 -0
- package/templates/frontend/design-to-component-translator/ SKILL.md +547 -0
- package/templates/frontend/form-wizard-builder/ SKILL.md +553 -0
- package/templates/frontend/frontend-refactor-planner/ SKILL.md +37 -0
- package/templates/frontend/i18n-frontend-implementer/ SKILL.md +44 -0
- package/templates/frontend/modal-drawer-system/ SKILL.md +377 -0
- package/templates/frontend/page-layout-builder/ SKILL.md +630 -0
- package/templates/frontend/state-ux-flow-builder/ SKILL.md +23 -0
- package/templates/frontend/table-builder/ SKILL.md +350 -0
- package/templates/performance/alerting-dashboard-builder/ SKILL.md +162 -0
- package/templates/performance/backend-latency-profiler-helper/ SKILL.md +108 -0
- package/templates/performance/caching-cdn-strategy-planner/ SKILL.md +150 -0
- package/templates/performance/capacity-planning-helper/ SKILL.md +242 -0
- package/templates/performance/core-web-vitals-tuner/ SKILL.md +126 -0
- package/templates/performance/incident-runbook-generator/ SKILL.md +162 -0
- package/templates/performance/load-test-scenario-builder/ SKILL.md +256 -0
- package/templates/performance/observability-setup/ SKILL.md +232 -0
- package/templates/performance/postmortem-writer/ SKILL.md +203 -0
- package/templates/performance/structured-logging-standardizer/ SKILL.md +122 -0
- package/templates/security/auth-security-reviewer/ SKILL.md +428 -0
- package/templates/security/dependency-vulnerability-triage/ SKILL.md +495 -0
- package/templates/security/input-validation-sanitization-auditor/ SKILL.md +76 -0
- package/templates/security/pii-redaction-logging-policy-builder/ SKILL.md +65 -0
- package/templates/security/rbac-policy-tester/ SKILL.md +80 -0
- package/templates/security/secrets-scanner/ SKILL.md +462 -0
- package/templates/security/secure-headers-csp-builder/ SKILL.md +404 -0
- package/templates/security/security-incident-playbook-generator/ SKILL.md +76 -0
- package/templates/security/security-pr-checklist-skill/ SKILL.md +62 -0
- package/templates/security/threat-model-generator/ SKILL.md +394 -0
- package/templates/testing/contract-testing-builder/ SKILL.md +492 -0
- package/templates/testing/coverage-strategist/ SKILL.md +436 -0
- package/templates/testing/e2e-test-builder/ SKILL.md +382 -0
- package/templates/testing/flaky-test-detective/ SKILL.md +416 -0
- package/templates/testing/integration-test-builder/ SKILL.md +525 -0
- package/templates/testing/mocking-assistant/ SKILL.md +383 -0
- package/templates/testing/snapshot-test-refactorer/ SKILL.md +375 -0
- package/templates/testing/test-data-factory-builder/ SKILL.md +449 -0
- package/templates/testing/test-reporting-triage-skill/ SKILL.md +469 -0
- package/templates/testing/unit-test-generator/ SKILL.md +548 -0
|
@@ -0,0 +1,283 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: llm-debugger
|
|
3
|
+
description: Diagnoses LLM output failures including hallucinations, constraint violations, format errors, and reasoning issues. Provides root cause classification, prompt fixes, tool improvements, and new test cases. Use for "debugging AI", "fixing prompts", "quality issues", or "output errors".
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# LLM Debugger
|
|
7
|
+
|
|
8
|
+
Systematically diagnose and fix LLM output issues.
|
|
9
|
+
|
|
10
|
+
## Failure Taxonomy
|
|
11
|
+
|
|
12
|
+
```python
|
|
13
|
+
class FailureType(Enum):
|
|
14
|
+
HALLUCINATION = "hallucination"
|
|
15
|
+
FORMAT_VIOLATION = "format_violation"
|
|
16
|
+
CONSTRAINT_BREAK = "constraint_break"
|
|
17
|
+
REASONING_ERROR = "reasoning_error"
|
|
18
|
+
TOOL_MISUSE = "tool_misuse"
|
|
19
|
+
REFUSAL = "unexpected_refusal"
|
|
20
|
+
INCOMPLETE = "incomplete_output"
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
## Root Cause Analysis
|
|
24
|
+
|
|
25
|
+
```python
|
|
26
|
+
def diagnose_failure(input: str, output: str, expected: dict) -> dict:
|
|
27
|
+
"""Identify why LLM output failed"""
|
|
28
|
+
|
|
29
|
+
issues = []
|
|
30
|
+
|
|
31
|
+
# Check format
|
|
32
|
+
if expected.get("format") == "json":
|
|
33
|
+
try:
|
|
34
|
+
json.loads(output)
|
|
35
|
+
except:
|
|
36
|
+
issues.append({
|
|
37
|
+
"type": FailureType.FORMAT_VIOLATION,
|
|
38
|
+
"details": "Invalid JSON output"
|
|
39
|
+
})
|
|
40
|
+
|
|
41
|
+
# Check required fields
|
|
42
|
+
if expected.get("required_fields"):
|
|
43
|
+
for field in expected["required_fields"]:
|
|
44
|
+
if field not in output:
|
|
45
|
+
issues.append({
|
|
46
|
+
"type": FailureType.INCOMPLETE,
|
|
47
|
+
"details": f"Missing required field: {field}"
|
|
48
|
+
})
|
|
49
|
+
|
|
50
|
+
# Check constraints
|
|
51
|
+
if expected.get("max_length"):
|
|
52
|
+
if len(output) > expected["max_length"]:
|
|
53
|
+
issues.append({
|
|
54
|
+
"type": FailureType.CONSTRAINT_BREAK,
|
|
55
|
+
"details": f"Output too long: {len(output)} > {expected['max_length']}"
|
|
56
|
+
})
|
|
57
|
+
|
|
58
|
+
# Check for hallucination indicators
|
|
59
|
+
if contains_hallucination_markers(output):
|
|
60
|
+
issues.append({
|
|
61
|
+
"type": FailureType.HALLUCINATION,
|
|
62
|
+
"details": "Contains fabricated information"
|
|
63
|
+
})
|
|
64
|
+
|
|
65
|
+
return {
|
|
66
|
+
"has_issues": len(issues) > 0,
|
|
67
|
+
"issues": issues,
|
|
68
|
+
"primary_issue": issues[0] if issues else None
|
|
69
|
+
}
|
|
70
|
+
|
|
71
|
+
def contains_hallucination_markers(output: str) -> bool:
|
|
72
|
+
"""Detect common hallucination patterns"""
|
|
73
|
+
markers = [
|
|
74
|
+
r'According to.*that doesn\'t exist',
|
|
75
|
+
r'In \d{4}.*before that year',
|
|
76
|
+
r'contradicts itself',
|
|
77
|
+
]
|
|
78
|
+
return any(re.search(marker, output, re.IGNORECASE) for marker in markers)
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
## Prompt Fixes
|
|
82
|
+
|
|
83
|
+
````python
|
|
84
|
+
PROMPT_FIXES = {
|
|
85
|
+
FailureType.FORMAT_VIOLATION: """
|
|
86
|
+
Add strict format instructions:
|
|
87
|
+
"Output MUST be valid JSON with this exact structure:
|
|
88
|
+
```json
|
|
89
|
+
{{"field1": "value", "field2": 123}}
|
|
90
|
+
````
|
|
91
|
+
|
|
92
|
+
Do not add any text before or after the JSON."
|
|
93
|
+
""",
|
|
94
|
+
|
|
95
|
+
FailureType.HALLUCINATION: """
|
|
96
|
+
|
|
97
|
+
Add grounding instructions:
|
|
98
|
+
"Base your response ONLY on the provided context.
|
|
99
|
+
If information is not in the context, say 'I don't have that information.'
|
|
100
|
+
Never make up facts or details."
|
|
101
|
+
""",
|
|
102
|
+
|
|
103
|
+
FailureType.CONSTRAINT_BREAK: """
|
|
104
|
+
|
|
105
|
+
Add explicit constraints:
|
|
106
|
+
"Your response must be:
|
|
107
|
+
|
|
108
|
+
- Maximum 200 words
|
|
109
|
+
- No code examples
|
|
110
|
+
- Professional tone only"
|
|
111
|
+
""",
|
|
112
|
+
FailureType.REASONING_ERROR: """
|
|
113
|
+
Add step-by-step reasoning:
|
|
114
|
+
"Think through this step by step:
|
|
115
|
+
|
|
116
|
+
1. First, identify...
|
|
117
|
+
2. Then, consider...
|
|
118
|
+
3. Finally, conclude..."
|
|
119
|
+
""",
|
|
120
|
+
}
|
|
121
|
+
|
|
122
|
+
def suggest_prompt_fix(diagnosis: dict, current_prompt: str) -> str:
|
|
123
|
+
"""Generate improved prompt based on diagnosis"""
|
|
124
|
+
primary_issue = diagnosis["primary_issue"]
|
|
125
|
+
if not primary_issue:
|
|
126
|
+
return current_prompt
|
|
127
|
+
|
|
128
|
+
fix = PROMPT_FIXES[primary_issue["type"]]
|
|
129
|
+
|
|
130
|
+
return f"""{current_prompt}
|
|
131
|
+
|
|
132
|
+
IMPORTANT INSTRUCTIONS:
|
|
133
|
+
{fix}
|
|
134
|
+
"""
|
|
135
|
+
|
|
136
|
+
````
|
|
137
|
+
|
|
138
|
+
## Tool Improvements
|
|
139
|
+
|
|
140
|
+
```python
|
|
141
|
+
def suggest_tool_fixes(diagnosis: dict, tool_schema: dict) -> dict:
|
|
142
|
+
"""Improve tool schema to prevent misuse"""
|
|
143
|
+
|
|
144
|
+
fixes = {}
|
|
145
|
+
|
|
146
|
+
for issue in diagnosis["issues"]:
|
|
147
|
+
if issue["type"] == FailureType.TOOL_MISUSE:
|
|
148
|
+
# Make required parameters more explicit
|
|
149
|
+
fixes["parameters"] = {
|
|
150
|
+
**tool_schema["parameters"],
|
|
151
|
+
"description": "REQUIRED: " + tool_schema["parameters"].get("description", "")
|
|
152
|
+
}
|
|
153
|
+
|
|
154
|
+
# Add examples
|
|
155
|
+
fixes["examples"] = [
|
|
156
|
+
{
|
|
157
|
+
"params": {"query": "example search"},
|
|
158
|
+
"description": "Use for finding information"
|
|
159
|
+
}
|
|
160
|
+
]
|
|
161
|
+
|
|
162
|
+
return {**tool_schema, **fixes}
|
|
163
|
+
````
|
|
164
|
+
|
|
165
|
+
## Test Case Generation
|
|
166
|
+
|
|
167
|
+
```python
|
|
168
|
+
def generate_test_cases(diagnosis: dict, failed_input: str, failed_output: str) -> List[dict]:
|
|
169
|
+
"""Create regression test cases from failures"""
|
|
170
|
+
|
|
171
|
+
test_cases = []
|
|
172
|
+
|
|
173
|
+
# Test case for the specific failure
|
|
174
|
+
test_cases.append({
|
|
175
|
+
"input": failed_input,
|
|
176
|
+
"expected_behavior": "Should not " + diagnosis["primary_issue"]["details"],
|
|
177
|
+
"validation": lambda output: not has_same_issue(output, diagnosis),
|
|
178
|
+
})
|
|
179
|
+
|
|
180
|
+
# Edge cases based on failure type
|
|
181
|
+
if diagnosis["primary_issue"]["type"] == FailureType.FORMAT_VIOLATION:
|
|
182
|
+
test_cases.append({
|
|
183
|
+
"input": failed_input,
|
|
184
|
+
"validation": lambda output: is_valid_json(output),
|
|
185
|
+
"description": "Output must be valid JSON"
|
|
186
|
+
})
|
|
187
|
+
|
|
188
|
+
# Similar inputs that might trigger same issue
|
|
189
|
+
similar_inputs = generate_similar_inputs(failed_input)
|
|
190
|
+
for inp in similar_inputs:
|
|
191
|
+
test_cases.append({
|
|
192
|
+
"input": inp,
|
|
193
|
+
"validation": lambda output: not has_same_issue(output, diagnosis),
|
|
194
|
+
})
|
|
195
|
+
|
|
196
|
+
return test_cases
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
## Debugging Workflow
|
|
200
|
+
|
|
201
|
+
```python
|
|
202
|
+
def debug_llm_output(
|
|
203
|
+
prompt: str,
|
|
204
|
+
output: str,
|
|
205
|
+
expected: dict,
|
|
206
|
+
context: dict = {}
|
|
207
|
+
) -> dict:
|
|
208
|
+
"""Complete debugging workflow"""
|
|
209
|
+
|
|
210
|
+
# 1. Diagnose issue
|
|
211
|
+
diagnosis = diagnose_failure(prompt, output, expected)
|
|
212
|
+
|
|
213
|
+
if not diagnosis["has_issues"]:
|
|
214
|
+
return {"status": "ok", "diagnosis": diagnosis}
|
|
215
|
+
|
|
216
|
+
# 2. Generate fixes
|
|
217
|
+
fixed_prompt = suggest_prompt_fix(diagnosis, prompt)
|
|
218
|
+
|
|
219
|
+
# 3. Create test cases
|
|
220
|
+
test_cases = generate_test_cases(diagnosis, prompt, output)
|
|
221
|
+
|
|
222
|
+
# 4. Test fix
|
|
223
|
+
test_output = llm(fixed_prompt)
|
|
224
|
+
fix_works = diagnose_failure(fixed_prompt, test_output, expected)["has_issues"] == False
|
|
225
|
+
|
|
226
|
+
return {
|
|
227
|
+
"status": "issues_found",
|
|
228
|
+
"diagnosis": diagnosis,
|
|
229
|
+
"fixes": {
|
|
230
|
+
"prompt": fixed_prompt,
|
|
231
|
+
"fix_verified": fix_works,
|
|
232
|
+
},
|
|
233
|
+
"test_cases": test_cases,
|
|
234
|
+
"recommendations": generate_recommendations(diagnosis, context)
|
|
235
|
+
}
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
## Interactive Debugging
|
|
239
|
+
|
|
240
|
+
```python
|
|
241
|
+
def interactive_debug():
|
|
242
|
+
"""Interactive debugging session"""
|
|
243
|
+
print("LLM Debugger - Interactive Mode")
|
|
244
|
+
|
|
245
|
+
prompt = input("Enter prompt: ")
|
|
246
|
+
expected_output = input("Expected output format (json/text): ")
|
|
247
|
+
|
|
248
|
+
output = llm(prompt)
|
|
249
|
+
print(f"\nGenerated output:\n{output}\n")
|
|
250
|
+
|
|
251
|
+
if input("Does this look correct? (y/n): ").lower() == 'n':
|
|
252
|
+
diagnosis = diagnose_failure(prompt, output, {"format": expected_output})
|
|
253
|
+
|
|
254
|
+
print("\nDiagnosis:")
|
|
255
|
+
for issue in diagnosis["issues"]:
|
|
256
|
+
print(f"- {issue['type'].value}: {issue['details']}")
|
|
257
|
+
|
|
258
|
+
print("\nSuggested fix:")
|
|
259
|
+
print(suggest_prompt_fix(diagnosis, prompt))
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
## Best Practices
|
|
263
|
+
|
|
264
|
+
1. **Reproduce consistently**: Multiple runs
|
|
265
|
+
2. **Isolate variables**: Test prompt changes one at a time
|
|
266
|
+
3. **Check examples**: Are few-shot examples clear?
|
|
267
|
+
4. **Validate constraints**: Are they explicit enough?
|
|
268
|
+
5. **Test edge cases**: Boundary conditions
|
|
269
|
+
6. **Log everything**: Inputs, outputs, issues
|
|
270
|
+
7. **Iterate systematically**: Track what fixes work
|
|
271
|
+
|
|
272
|
+
## Output Checklist
|
|
273
|
+
|
|
274
|
+
- [ ] Failure taxonomy defined
|
|
275
|
+
- [ ] Root cause analyzer
|
|
276
|
+
- [ ] Prompt fix suggestions
|
|
277
|
+
- [ ] Tool schema improvements
|
|
278
|
+
- [ ] Test case generator
|
|
279
|
+
- [ ] Fix verification
|
|
280
|
+
- [ ] Debugging workflow
|
|
281
|
+
- [ ] Recommendations engine
|
|
282
|
+
- [ ] Interactive debugger
|
|
283
|
+
- [ ] Logging system
|
|
@@ -0,0 +1,216 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: prompt-regression-tester
|
|
3
|
+
description: Compares old vs new prompts across test cases with diff summaries, stability metrics, breakage analysis, and fix suggestions. Use for "prompt testing", "A/B testing prompts", "prompt versioning", or "quality regression".
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Prompt Regression Tester
|
|
7
|
+
|
|
8
|
+
Systematically test prompt changes to prevent regressions.
|
|
9
|
+
|
|
10
|
+
## Test Case Format
|
|
11
|
+
|
|
12
|
+
```json
|
|
13
|
+
{
|
|
14
|
+
"test_cases": [
|
|
15
|
+
{
|
|
16
|
+
"id": "test_001",
|
|
17
|
+
"input": "Summarize this article",
|
|
18
|
+
"context": "Article text here...",
|
|
19
|
+
"expected_behavior": "Concise 2-3 sentence summary",
|
|
20
|
+
"baseline_output": "Output from v1.0 prompt",
|
|
21
|
+
"must_include": ["main point", "conclusion"],
|
|
22
|
+
"must_not_include": ["opinion", "speculation"]
|
|
23
|
+
}
|
|
24
|
+
]
|
|
25
|
+
}
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## Comparison Framework
|
|
29
|
+
|
|
30
|
+
```python
|
|
31
|
+
def compare_prompts(old_prompt, new_prompt, test_cases):
|
|
32
|
+
results = {
|
|
33
|
+
"test_cases": [],
|
|
34
|
+
"summary": {
|
|
35
|
+
"total": len(test_cases),
|
|
36
|
+
"improvements": 0,
|
|
37
|
+
"regressions": 0,
|
|
38
|
+
"unchanged": 0,
|
|
39
|
+
},
|
|
40
|
+
"breakages": []
|
|
41
|
+
}
|
|
42
|
+
|
|
43
|
+
for test in test_cases:
|
|
44
|
+
old_output = llm(old_prompt.format(**test))
|
|
45
|
+
new_output = llm(new_prompt.format(**test))
|
|
46
|
+
|
|
47
|
+
comparison = {
|
|
48
|
+
"test_id": test["id"],
|
|
49
|
+
"old_output": old_output,
|
|
50
|
+
"new_output": new_output,
|
|
51
|
+
"diff": compute_diff(old_output, new_output),
|
|
52
|
+
"scores": {
|
|
53
|
+
"old": score_output(old_output, test),
|
|
54
|
+
"new": score_output(new_output, test),
|
|
55
|
+
},
|
|
56
|
+
"verdict": classify_change(old_output, new_output, test)
|
|
57
|
+
}
|
|
58
|
+
|
|
59
|
+
results["test_cases"].append(comparison)
|
|
60
|
+
results["summary"][comparison["verdict"]] += 1
|
|
61
|
+
|
|
62
|
+
if comparison["verdict"] == "regressions":
|
|
63
|
+
results["breakages"].append(analyze_breakage(comparison, test))
|
|
64
|
+
|
|
65
|
+
return results
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
## Stability Metrics
|
|
69
|
+
|
|
70
|
+
```python
|
|
71
|
+
def calculate_stability_metrics(results):
|
|
72
|
+
return {
|
|
73
|
+
"output_stability": measure_output_consistency(results),
|
|
74
|
+
"format_stability": check_format_preservation(results),
|
|
75
|
+
"constraint_adherence": check_constraints(results),
|
|
76
|
+
"behavioral_consistency": measure_behavior_delta(results),
|
|
77
|
+
}
|
|
78
|
+
|
|
79
|
+
def measure_output_consistency(results):
|
|
80
|
+
"""How similar are outputs between versions?"""
|
|
81
|
+
similarities = []
|
|
82
|
+
for result in results["test_cases"]:
|
|
83
|
+
sim = semantic_similarity(
|
|
84
|
+
result["old_output"],
|
|
85
|
+
result["new_output"]
|
|
86
|
+
)
|
|
87
|
+
similarities.append(sim)
|
|
88
|
+
return sum(similarities) / len(similarities)
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
## Breakage Analysis
|
|
92
|
+
|
|
93
|
+
```python
|
|
94
|
+
def analyze_breakage(comparison, test_case):
|
|
95
|
+
"""Identify why the new prompt failed"""
|
|
96
|
+
reasons = []
|
|
97
|
+
|
|
98
|
+
new_out = comparison["new_output"]
|
|
99
|
+
|
|
100
|
+
# Missing required content
|
|
101
|
+
for keyword in test_case.get("must_include", []):
|
|
102
|
+
if keyword.lower() not in new_out.lower():
|
|
103
|
+
reasons.append(f"Missing required content: '{keyword}'")
|
|
104
|
+
|
|
105
|
+
# Contains forbidden content
|
|
106
|
+
for keyword in test_case.get("must_not_include", []):
|
|
107
|
+
if keyword.lower() in new_out.lower():
|
|
108
|
+
reasons.append(f"Contains forbidden content: '{keyword}'")
|
|
109
|
+
|
|
110
|
+
# Format violations
|
|
111
|
+
if not check_format(new_out, test_case.get("expected_format")):
|
|
112
|
+
reasons.append("Output format violation")
|
|
113
|
+
|
|
114
|
+
# Length issues
|
|
115
|
+
expected_length = test_case.get("expected_length")
|
|
116
|
+
if expected_length:
|
|
117
|
+
actual_length = len(new_out.split())
|
|
118
|
+
if abs(actual_length - expected_length) > expected_length * 0.3:
|
|
119
|
+
reasons.append(f"Length deviation: expected ~{expected_length}, got {actual_length}")
|
|
120
|
+
|
|
121
|
+
return {
|
|
122
|
+
"test_id": test_case["id"],
|
|
123
|
+
"reasons": reasons,
|
|
124
|
+
"old_output": comparison["old_output"][:100],
|
|
125
|
+
"new_output": comparison["new_output"][:100],
|
|
126
|
+
}
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
## Fix Suggestions
|
|
130
|
+
|
|
131
|
+
```python
|
|
132
|
+
def suggest_fixes(breakages):
|
|
133
|
+
"""Generate fix suggestions based on breakage patterns"""
|
|
134
|
+
suggestions = []
|
|
135
|
+
|
|
136
|
+
# Group breakages by reason
|
|
137
|
+
reason_groups = {}
|
|
138
|
+
for breakage in breakages:
|
|
139
|
+
for reason in breakage["reasons"]:
|
|
140
|
+
if reason not in reason_groups:
|
|
141
|
+
reason_groups[reason] = []
|
|
142
|
+
reason_groups[reason].append(breakage["test_id"])
|
|
143
|
+
|
|
144
|
+
# Generate suggestions
|
|
145
|
+
for reason, test_ids in reason_groups.items():
|
|
146
|
+
if "Missing required content" in reason:
|
|
147
|
+
suggestions.append({
|
|
148
|
+
"issue": reason,
|
|
149
|
+
"affected_tests": test_ids,
|
|
150
|
+
"suggestion": "Add explicit instruction in prompt to include this content",
|
|
151
|
+
"example": f"Make sure to mention {reason.split(':')[1]} in your response."
|
|
152
|
+
})
|
|
153
|
+
elif "format violation" in reason:
|
|
154
|
+
suggestions.append({
|
|
155
|
+
"issue": reason,
|
|
156
|
+
"affected_tests": test_ids,
|
|
157
|
+
"suggestion": "Add stricter format constraints to prompt",
|
|
158
|
+
"example": "Output must follow this exact format: ..."
|
|
159
|
+
})
|
|
160
|
+
|
|
161
|
+
return suggestions
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
## Report Generation
|
|
165
|
+
|
|
166
|
+
```markdown
|
|
167
|
+
# Prompt Regression Report
|
|
168
|
+
|
|
169
|
+
## Summary
|
|
170
|
+
|
|
171
|
+
- **Total tests:** 50
|
|
172
|
+
- **Improvements:** 5 (10%)
|
|
173
|
+
- **Regressions:** 3 (6%)
|
|
174
|
+
- **Unchanged:** 42 (84%)
|
|
175
|
+
|
|
176
|
+
## Stability Metrics
|
|
177
|
+
|
|
178
|
+
- **Output stability:** 0.87
|
|
179
|
+
- **Format stability:** 0.95
|
|
180
|
+
- **Constraint adherence:** 0.94
|
|
181
|
+
|
|
182
|
+
## Regressions (3)
|
|
183
|
+
|
|
184
|
+
### test_005: Missing required content
|
|
185
|
+
|
|
186
|
+
**Old output:** "The main benefit is cost savings..."
|
|
187
|
+
**New output:** "This approach provides flexibility..."
|
|
188
|
+
**Issue:** Missing required keyword 'cost'
|
|
189
|
+
**Fix:** Add explicit instruction: "Mention cost implications in your response"
|
|
190
|
+
|
|
191
|
+
## Recommended Actions
|
|
192
|
+
|
|
193
|
+
1. Revert changes that caused regressions (tests: 005, 012, 023)
|
|
194
|
+
2. Add format constraints for JSON output
|
|
195
|
+
3. Run full test suite before deployment
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
## Best Practices
|
|
199
|
+
|
|
200
|
+
- Test with diverse inputs
|
|
201
|
+
- Compare across multiple runs (LLMs are stochastic)
|
|
202
|
+
- Track metrics over time
|
|
203
|
+
- Automate in CI/CD
|
|
204
|
+
- Review all regressions before deploy
|
|
205
|
+
- Maintain test case library
|
|
206
|
+
|
|
207
|
+
## Output Checklist
|
|
208
|
+
|
|
209
|
+
- [ ] Test cases defined (30+)
|
|
210
|
+
- [ ] Comparison runner
|
|
211
|
+
- [ ] Stability metrics
|
|
212
|
+
- [ ] Breakage analyzer
|
|
213
|
+
- [ ] Fix suggestions
|
|
214
|
+
- [ ] Diff visualizer
|
|
215
|
+
- [ ] Automated report
|
|
216
|
+
- [ ] CI integration
|