llm-regtest 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- llm_regtest-0.1.0/PKG-INFO +620 -0
- llm_regtest-0.1.0/README.md +593 -0
- llm_regtest-0.1.0/pyproject.toml +30 -0
- llm_regtest-0.1.0/setup.cfg +4 -0
- llm_regtest-0.1.0/src/llm_regtest/__init__.py +3 -0
- llm_regtest-0.1.0/src/llm_regtest/__main__.py +5 -0
- llm_regtest-0.1.0/src/llm_regtest/analyzer.py +136 -0
- llm_regtest-0.1.0/src/llm_regtest/cli.py +421 -0
- llm_regtest-0.1.0/src/llm_regtest/config.py +136 -0
- llm_regtest-0.1.0/src/llm_regtest/loader.py +132 -0
- llm_regtest-0.1.0/src/llm_regtest/models/__init__.py +47 -0
- llm_regtest-0.1.0/src/llm_regtest/models/anthropic_model.py +85 -0
- llm_regtest-0.1.0/src/llm_regtest/models/base.py +53 -0
- llm_regtest-0.1.0/src/llm_regtest/models/openai_model.py +90 -0
- llm_regtest-0.1.0/src/llm_regtest/reporter.py +244 -0
- llm_regtest-0.1.0/src/llm_regtest/runner.py +125 -0
- llm_regtest-0.1.0/src/llm_regtest/scorer.py +189 -0
- llm_regtest-0.1.0/src/llm_regtest/tests/__init__.py +0 -0
- llm_regtest-0.1.0/src/llm_regtest/tests/conftest.py +57 -0
- llm_regtest-0.1.0/src/llm_regtest/tests/test_analyzer.py +98 -0
- llm_regtest-0.1.0/src/llm_regtest/tests/test_cli.py +83 -0
- llm_regtest-0.1.0/src/llm_regtest/tests/test_loader.py +83 -0
- llm_regtest-0.1.0/src/llm_regtest/tests/test_scorer.py +117 -0
- llm_regtest-0.1.0/src/llm_regtest/types.py +100 -0
- llm_regtest-0.1.0/src/llm_regtest.egg-info/PKG-INFO +620 -0
- llm_regtest-0.1.0/src/llm_regtest.egg-info/SOURCES.txt +28 -0
- llm_regtest-0.1.0/src/llm_regtest.egg-info/dependency_links.txt +1 -0
- llm_regtest-0.1.0/src/llm_regtest.egg-info/entry_points.txt +2 -0
- llm_regtest-0.1.0/src/llm_regtest.egg-info/requires.txt +24 -0
- llm_regtest-0.1.0/src/llm_regtest.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,620 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: llm-regtest
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Lightweight regression testing for LLM prompts
|
|
5
|
+
Author: Libin Samkutty
|
|
6
|
+
License: MIT
|
|
7
|
+
Requires-Python: >=3.10
|
|
8
|
+
Description-Content-Type: text/markdown
|
|
9
|
+
Provides-Extra: openai
|
|
10
|
+
Requires-Dist: openai>=1.0; extra == "openai"
|
|
11
|
+
Provides-Extra: anthropic
|
|
12
|
+
Requires-Dist: anthropic>=0.18; extra == "anthropic"
|
|
13
|
+
Provides-Extra: yaml
|
|
14
|
+
Requires-Dist: pyyaml>=6.0; extra == "yaml"
|
|
15
|
+
Provides-Extra: semantic
|
|
16
|
+
Requires-Dist: sentence-transformers>=2.2; extra == "semantic"
|
|
17
|
+
Requires-Dist: numpy>=1.24; extra == "semantic"
|
|
18
|
+
Provides-Extra: all
|
|
19
|
+
Requires-Dist: openai>=1.0; extra == "all"
|
|
20
|
+
Requires-Dist: anthropic>=0.18; extra == "all"
|
|
21
|
+
Requires-Dist: pyyaml>=6.0; extra == "all"
|
|
22
|
+
Requires-Dist: sentence-transformers>=2.2; extra == "all"
|
|
23
|
+
Requires-Dist: numpy>=1.24; extra == "all"
|
|
24
|
+
Provides-Extra: dev
|
|
25
|
+
Requires-Dist: pytest>=7.0; extra == "dev"
|
|
26
|
+
Requires-Dist: pytest-cov>=4.0; extra == "dev"
|
|
27
|
+
|
|
28
|
+
# llm-regtest
|
|
29
|
+
|
|
30
|
+
A lightweight regression testing framework for LLM prompts. Catch semantic drift before it ships.
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
## What does this tool do?
|
|
35
|
+
|
|
36
|
+
When you use an AI model, the exact wording of your instructions — called a **prompt** — has a big effect on the quality of responses. Small wording changes can cause unexpected regressions that are impossible to spot without running tests.
|
|
37
|
+
|
|
38
|
+
This tool works like a test suite for prompts:
|
|
39
|
+
|
|
40
|
+
1. You define test cases with prompts and optional inputs
|
|
41
|
+
2. Run `update-baseline` — the AI's responses become your **golden dataset**
|
|
42
|
+
3. Later, after changing a prompt or upgrading a model, run `llm-regtest run`
|
|
43
|
+
4. The tool compares new responses to the baselines and flags anything that regressed
|
|
44
|
+
|
|
45
|
+
Each comparison produces a **score from 0.0 to 1.0** using one or more methods:
|
|
46
|
+
|
|
47
|
+
| Method | How it works | Best for |
|
|
48
|
+
|--------|-------------|----------|
|
|
49
|
+
| `exact` | Character-for-character match (0 or 1) | Classification labels, structured outputs |
|
|
50
|
+
| `fuzzy` | Levenshtein edit-distance ratio | General text where minor wording drift is acceptable |
|
|
51
|
+
| `semantic` | Cosine similarity of sentence embeddings (`all-MiniLM-L6-v2`) | Longer outputs where meaning matters more than wording |
|
|
52
|
+
| `llm_judge` | An LLM rates quality similarity 0.0–1.0 | Conversational outputs where both fuzzy and semantic are too strict |
|
|
53
|
+
|
|
54
|
+
Results are bucketed into **PASS / WARN / FAIL** based on configurable score thresholds.
|
|
55
|
+
|
|
56
|
+
---
|
|
57
|
+
|
|
58
|
+
## What you need before starting
|
|
59
|
+
|
|
60
|
+
- **Python 3.10 or newer** ([download here](https://www.python.org/downloads/))
|
|
61
|
+
- **pip** — comes with Python automatically
|
|
62
|
+
- **An API key** — for OpenAI ([get one here](https://platform.openai.com/api-keys)) or Anthropic ([get one here](https://console.anthropic.com/))
|
|
63
|
+
|
|
64
|
+
To check your Python version:
|
|
65
|
+
|
|
66
|
+
```bash
|
|
67
|
+
python --version
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
---
|
|
71
|
+
|
|
72
|
+
## Installation
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
# OpenAI support
|
|
76
|
+
pip install -e ".[openai]"
|
|
77
|
+
|
|
78
|
+
# Anthropic (Claude) support
|
|
79
|
+
pip install -e ".[anthropic]"
|
|
80
|
+
|
|
81
|
+
# Semantic similarity scoring (sentence-transformers)
|
|
82
|
+
pip install -e ".[semantic]"
|
|
83
|
+
|
|
84
|
+
# Everything
|
|
85
|
+
pip install -e ".[all]"
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
> The `[semantic]` extra installs `sentence-transformers` and `numpy`. The first time you use semantic scoring, the `all-MiniLM-L6-v2` model (~80 MB) is downloaded automatically.
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
## Quickstart
|
|
93
|
+
|
|
94
|
+
### 1. Initialise the project
|
|
95
|
+
|
|
96
|
+
```bash
|
|
97
|
+
llm-regtest init
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
Creates `.promptregtest/config.json` and a `prompt_cases.json` file.
|
|
101
|
+
|
|
102
|
+
### 2. Set your API key
|
|
103
|
+
|
|
104
|
+
```bash
|
|
105
|
+
# Mac / Linux
|
|
106
|
+
export OPENAI_API_KEY="sk-..."
|
|
107
|
+
|
|
108
|
+
# Windows PowerShell
|
|
109
|
+
$env:OPENAI_API_KEY="sk-..."
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
### 3. Define test cases
|
|
113
|
+
|
|
114
|
+
Edit `prompt_cases.json`:
|
|
115
|
+
|
|
116
|
+
```json
|
|
117
|
+
[
|
|
118
|
+
{
|
|
119
|
+
"id": "summarize-article",
|
|
120
|
+
"prompt": "Summarize the following text in one sentence:",
|
|
121
|
+
"input": "Scientists discovered that short 10-minute walks after meals reduce blood sugar spikes.",
|
|
122
|
+
"baseline_output": "",
|
|
123
|
+
"tags": ["summarization"]
|
|
124
|
+
},
|
|
125
|
+
{
|
|
126
|
+
"id": "sentiment-label",
|
|
127
|
+
"prompt": "Classify the sentiment. Reply with exactly one word: positive, negative, or neutral.",
|
|
128
|
+
"input": "I absolutely loved the product!",
|
|
129
|
+
"baseline_output": "",
|
|
130
|
+
"tags": ["classification"]
|
|
131
|
+
}
|
|
132
|
+
]
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### 4. Generate baselines
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
llm-regtest update-baseline
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
This calls the AI for every case and saves the responses as baselines inside `prompt_cases.json`.
|
|
142
|
+
|
|
143
|
+
### 5. Run the regression tests
|
|
144
|
+
|
|
145
|
+
```bash
|
|
146
|
+
llm-regtest run
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
Sample output:
|
|
150
|
+
|
|
151
|
+
```
|
|
152
|
+
[PASS] summarize-article (fuzzy: 0.94, semantic: 0.97, agg=0.96) [312ms, $0.00018]
|
|
153
|
+
[WARN] email-rewrite (fuzzy: 0.71, semantic: 0.79, agg=0.77) [289ms, $0.00014]
|
|
154
|
+
[FAIL] sentiment-label (exact: 0.00, fuzzy: 0.43, agg=0.21) [198ms, $0.00009]
|
|
155
|
+
|
|
156
|
+
------------------------------------------------
|
|
157
|
+
Results (3 cases): 1 passed, 1 warned, 1 failed
|
|
158
|
+
Cost: $0.00041 | Latency: 266.3ms avg
|
|
159
|
+
------------------------------------------------
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
Each result line now shows **latency** (ms) and **cost** (USD) alongside the score. The summary line aggregates total cost and average latency across the run.
|
|
163
|
+
|
|
164
|
+
### 6. View a saved report
|
|
165
|
+
|
|
166
|
+
```bash
|
|
167
|
+
llm-regtest report
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
---
|
|
171
|
+
|
|
172
|
+
## How the Regression Check Works
|
|
173
|
+
|
|
174
|
+
```
|
|
175
|
+
prompt_cases.json .promptregtest/baselines/
|
|
176
|
+
┌──────────────────┐ ┌───────────────────────────┐
|
|
177
|
+
│ id: "summarize" │ │ summarize.txt │
|
|
178
|
+
│ prompt: "..." │──run──▶ │ "Short 10-min walks..." │◀── baseline
|
|
179
|
+
│ input: "..." │ └───────────────────────────┘
|
|
180
|
+
└──────────────────┘ │
|
|
181
|
+
│ compare
|
|
182
|
+
New run output ──────────────────┘
|
|
183
|
+
"Brief post-meal walks..."
|
|
184
|
+
│
|
|
185
|
+
▼
|
|
186
|
+
scorer.semantic_similarity(new, baseline)
|
|
187
|
+
= cosine(embed(new), embed(baseline))
|
|
188
|
+
= 0.91 → PASS (threshold: 0.85)
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
On each run, the tool:
|
|
192
|
+
|
|
193
|
+
1. Loads every case from `prompt_cases.json`
|
|
194
|
+
2. Sends the prompt (+ optional input) to the configured model
|
|
195
|
+
3. Measures **latency** (wall-clock ms) and **token counts** for that call
|
|
196
|
+
4. Computes **USD cost** from the provider's pricing table
|
|
197
|
+
5. Runs each configured scorer against the stored baseline
|
|
198
|
+
6. Computes a **weighted aggregate score**
|
|
199
|
+
7. Compares it to the `pass` / `warn` thresholds
|
|
200
|
+
8. Saves a JSON report to `.promptregtest/reports/`
|
|
201
|
+
|
|
202
|
+
---
|
|
203
|
+
|
|
204
|
+
## Cost and Latency Tracking
|
|
205
|
+
|
|
206
|
+
Every run automatically captures:
|
|
207
|
+
|
|
208
|
+
| Metric | Where it appears |
|
|
209
|
+
|--------|-----------------|
|
|
210
|
+
| Per-request latency (ms) | Beside each result: `[312ms, $0.00018]` |
|
|
211
|
+
| Per-request USD cost | Beside each result: `[312ms, $0.00018]` |
|
|
212
|
+
| Total run cost | Summary line: `Cost: $0.00041` |
|
|
213
|
+
| Average latency | Summary line: `Latency: 266.3ms avg` |
|
|
214
|
+
| Input / output token counts | Saved in the JSON report |
|
|
215
|
+
|
|
216
|
+
Cost is calculated from built-in pricing tables for common OpenAI and Anthropic models. Unknown models report `$0.00000` (no crash). The JSON report stores `input_tokens`, `output_tokens`, `cost_usd`, and `latency_ms` per case for later analysis.
|
|
217
|
+
|
|
218
|
+
---
|
|
219
|
+
|
|
220
|
+
## Semantic Similarity Scoring
|
|
221
|
+
|
|
222
|
+
The `semantic` scorer encodes both the new output and the baseline as sentence embeddings, then computes their **cosine similarity**. Two responses that express the same idea in different words score close to 1.0; responses with entirely different meaning score near 0.0.
|
|
223
|
+
|
|
224
|
+
**Install:**
|
|
225
|
+
|
|
226
|
+
```bash
|
|
227
|
+
pip install -e ".[semantic]"
|
|
228
|
+
```
|
|
229
|
+
|
|
230
|
+
**Configure:**
|
|
231
|
+
|
|
232
|
+
```json
|
|
233
|
+
{
|
|
234
|
+
"scoring": {
|
|
235
|
+
"methods": ["fuzzy", "semantic"],
|
|
236
|
+
"weights": { "fuzzy": 0.3, "semantic": 0.7 },
|
|
237
|
+
"thresholds": { "pass": 0.85, "warn": 0.65 }
|
|
238
|
+
}
|
|
239
|
+
}
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
The `all-MiniLM-L6-v2` model is downloaded once and cached in memory across cases in the same run. If `sentence-transformers` is not installed, the scorer is simply not registered — no crash on import. It will only fail if you explicitly request `"semantic"` in your config without the package installed.
|
|
243
|
+
|
|
244
|
+
---
|
|
245
|
+
|
|
246
|
+
## Comparing Two Versions of a Prompt
|
|
247
|
+
|
|
248
|
+
The most common workflow: lock in version A, change the prompt, compare.
|
|
249
|
+
|
|
250
|
+
**1. Generate the baseline (version A)**
|
|
251
|
+
|
|
252
|
+
```bash
|
|
253
|
+
llm-regtest update-baseline
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
**2. Modify your prompt in `prompt_cases.json`**
|
|
257
|
+
|
|
258
|
+
```json
|
|
259
|
+
"prompt": "Write a one-sentence summary focusing on the key finding:"
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
**3. Run the comparison**
|
|
263
|
+
|
|
264
|
+
```bash
|
|
265
|
+
llm-regtest run --verbose
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
`--verbose` also prints a line-by-line diff between the baseline and the new output for each case.
|
|
269
|
+
|
|
270
|
+
**4. Compare two saved reports side-by-side**
|
|
271
|
+
|
|
272
|
+
```bash
|
|
273
|
+
llm-regtest compare \
|
|
274
|
+
--report-a .promptregtest/reports/report_before.json \
|
|
275
|
+
--report-b .promptregtest/reports/report_after.json
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
Output:
|
|
279
|
+
|
|
280
|
+
```
|
|
281
|
+
Case score-a score-b delta change
|
|
282
|
+
------------------------------------------------------
|
|
283
|
+
email-reply 0.73 0.95 +0.22 IMPROVED
|
|
284
|
+
summarize-article 0.91 0.84 -0.07 REGRESSED
|
|
285
|
+
sentiment-label 1.00 1.00 +0.00 unchanged
|
|
286
|
+
|
|
287
|
+
1 improved, 1 regressed, 1 unchanged
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
---
|
|
291
|
+
|
|
292
|
+
## End-to-End OpenAI Demo
|
|
293
|
+
|
|
294
|
+
A complete demo script is included that shows the full workflow with real OpenAI API calls — including baseline generation, clean regression run, drift simulation, and A/B comparison.
|
|
295
|
+
|
|
296
|
+
**Prerequisites:**
|
|
297
|
+
|
|
298
|
+
```bash
|
|
299
|
+
pip install "llm-regtest[openai,semantic]"
|
|
300
|
+
export OPENAI_API_KEY="sk-..."
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
**Run:**
|
|
304
|
+
|
|
305
|
+
```bash
|
|
306
|
+
python examples/demo_openai.py
|
|
307
|
+
```
|
|
308
|
+
|
|
309
|
+
The script runs four steps automatically:
|
|
310
|
+
|
|
311
|
+
| Step | What happens |
|
|
312
|
+
|------|-------------|
|
|
313
|
+
| 1/4 Generate baselines | Calls `gpt-4o-mini` for all 8 cases and saves responses |
|
|
314
|
+
| 2/4 Regression run | Reruns all cases — scores should be near 1.0 |
|
|
315
|
+
| 3/4 Simulate drift | Rewrites two prompt wordings to mimic a real prompt change |
|
|
316
|
+
| 4/4 Detect regression | Reruns with drifted prompts — expect WARN/FAIL on changed cases |
|
|
317
|
+
| Bonus A/B compare | Side-by-side delta table between the two runs |
|
|
318
|
+
|
|
319
|
+
The 8 demo cases cover summarization, tone rewriting, sentiment classification, and factual Q&A — a representative spread for testing how different task types respond to prompt drift.
|
|
320
|
+
|
|
321
|
+
---
|
|
322
|
+
|
|
323
|
+
## Configuration Reference
|
|
324
|
+
|
|
325
|
+
Default config lives at `.promptregtest/config.json`:
|
|
326
|
+
|
|
327
|
+
```json
|
|
328
|
+
{
|
|
329
|
+
"model": {
|
|
330
|
+
"provider": "openai",
|
|
331
|
+
"model_name": "gpt-4o-mini",
|
|
332
|
+
"temperature": 0.0,
|
|
333
|
+
"max_tokens": 1024,
|
|
334
|
+
"system_prompt": ""
|
|
335
|
+
},
|
|
336
|
+
"scoring": {
|
|
337
|
+
"methods": ["fuzzy", "semantic"],
|
|
338
|
+
"weights": { "fuzzy": 0.3, "semantic": 0.7 },
|
|
339
|
+
"thresholds": {
|
|
340
|
+
"pass": 0.85,
|
|
341
|
+
"warn": 0.65
|
|
342
|
+
},
|
|
343
|
+
"llm_judge_model": null
|
|
344
|
+
},
|
|
345
|
+
"prompt_cases_path": "prompt_cases.json",
|
|
346
|
+
"reports_dir": ".promptregtest/reports",
|
|
347
|
+
"baselines_dir": ".promptregtest/baselines",
|
|
348
|
+
"concurrency": 1
|
|
349
|
+
}
|
|
350
|
+
```
|
|
351
|
+
|
|
352
|
+
| Setting | What it does |
|
|
353
|
+
|---------|-------------|
|
|
354
|
+
| `provider` | `"openai"`, `"anthropic"`, or `"stub"` (no API key needed, for testing) |
|
|
355
|
+
| `model_name` | Model ID, e.g. `"gpt-4o-mini"`, `"claude-sonnet-4-6"` |
|
|
356
|
+
| `temperature` | Set to `0.0` for deterministic, repeatable outputs — strongly recommended for testing |
|
|
357
|
+
| `max_tokens` | Maximum response length |
|
|
358
|
+
| `system_prompt` | Global system-level instruction sent with every case |
|
|
359
|
+
| `methods` | Scorers to use: any combination of `"exact"`, `"fuzzy"`, `"semantic"`, `"llm_judge"` |
|
|
360
|
+
| `weights` | Per-method weights for the aggregate. Omit for equal weighting |
|
|
361
|
+
| `thresholds.pass` | Score at or above this → PASS (default: `0.8`) |
|
|
362
|
+
| `thresholds.warn` | Score at or above this → WARN (default: `0.5`); below → FAIL |
|
|
363
|
+
| `llm_judge_model` | Model config for the LLM-as-judge scorer (same shape as `model`) |
|
|
364
|
+
| `concurrency` | Cases to run in parallel (default: `1`) |
|
|
365
|
+
|
|
366
|
+
---
|
|
367
|
+
|
|
368
|
+
## Test Case Fields
|
|
369
|
+
|
|
370
|
+
```json
|
|
371
|
+
{
|
|
372
|
+
"id": "my-test",
|
|
373
|
+
"prompt": "Summarize in one sentence:",
|
|
374
|
+
"prompt_file": "prompts/summarize.md",
|
|
375
|
+
"system_prompt": "You are a concise assistant.",
|
|
376
|
+
"system_prompt_file": "prompts/system/concise.md",
|
|
377
|
+
"input": "Text to summarize goes here.",
|
|
378
|
+
"inputs": ["Input A", "Input B", "Input C"],
|
|
379
|
+
"inputs_file": "fixtures/reviews.json",
|
|
380
|
+
"variables": { "name": "Alice", "role": "engineer" },
|
|
381
|
+
"baseline_output": "",
|
|
382
|
+
"tags": ["smoke", "summarization"]
|
|
383
|
+
}
|
|
384
|
+
```
|
|
385
|
+
|
|
386
|
+
| Field | Required | What it does |
|
|
387
|
+
|-------|----------|--------------|
|
|
388
|
+
| `id` | Yes | Unique identifier (no spaces) |
|
|
389
|
+
| `prompt` | Yes* | The instruction text |
|
|
390
|
+
| `prompt_file` | Yes* | Path to a `.txt` or `.md` file containing the prompt |
|
|
391
|
+
| `system_prompt` | No | Per-case system prompt (overrides global config) |
|
|
392
|
+
| `system_prompt_file` | No | Path to a file containing the system prompt |
|
|
393
|
+
| `input` | No | Extra text appended to the prompt |
|
|
394
|
+
| `inputs` | No | List of inputs — generates `id[0]`, `id[1]`, ... sub-cases |
|
|
395
|
+
| `inputs_file` | No | JSON file with a list of input strings |
|
|
396
|
+
| `variables` | No | Values for `{placeholder}` templates in the prompt |
|
|
397
|
+
| `baseline_output` | No | Auto-filled by `update-baseline` — leave blank |
|
|
398
|
+
| `tags` | No | Labels for filtering with `--tag` |
|
|
399
|
+
|
|
400
|
+
*`prompt` or `prompt_file` is required, not both.
|
|
401
|
+
|
|
402
|
+
---
|
|
403
|
+
|
|
404
|
+
## CLI Reference
|
|
405
|
+
|
|
406
|
+
| Command | What it does |
|
|
407
|
+
|---------|-------------|
|
|
408
|
+
| `llm-regtest init` | Create the project folder structure |
|
|
409
|
+
| `llm-regtest init --ci` | Also create a GitHub Actions workflow file |
|
|
410
|
+
| `llm-regtest update-baseline` | Run prompts and save responses as new baselines |
|
|
411
|
+
| `llm-regtest run` | Run prompts and compare to existing baselines |
|
|
412
|
+
| `llm-regtest report` | Display the most recent saved report |
|
|
413
|
+
| `llm-regtest compare` | Compare two saved reports side-by-side |
|
|
414
|
+
|
|
415
|
+
**Flags for `run` and `update-baseline`:**
|
|
416
|
+
|
|
417
|
+
| Flag | What it does |
|
|
418
|
+
|------|-------------|
|
|
419
|
+
| `--config PATH` | Use a non-default config file |
|
|
420
|
+
| `--case ID` | Run only this case ID (repeatable) |
|
|
421
|
+
| `--tag TAG` | Run only cases with this tag (repeatable, OR logic) |
|
|
422
|
+
| `--concurrency N` | Run N cases in parallel |
|
|
423
|
+
|
|
424
|
+
**Flags for `run` only:**
|
|
425
|
+
|
|
426
|
+
| Flag | What it does |
|
|
427
|
+
|------|-------------|
|
|
428
|
+
| `--verbose` / `-v` | Print a unified diff for each case |
|
|
429
|
+
| `--format console` | Default coloured output |
|
|
430
|
+
| `--format github` | GitHub Actions `::error::` / `::warning::` annotations |
|
|
431
|
+
|
|
432
|
+
---
|
|
433
|
+
|
|
434
|
+
## Advanced Features
|
|
435
|
+
|
|
436
|
+
### Semantic similarity
|
|
437
|
+
|
|
438
|
+
```json
|
|
439
|
+
"scoring": {
|
|
440
|
+
"methods": ["fuzzy", "semantic"],
|
|
441
|
+
"weights": { "fuzzy": 0.3, "semantic": 0.7 }
|
|
442
|
+
}
|
|
443
|
+
```
|
|
444
|
+
|
|
445
|
+
Requires `pip install -e ".[semantic]"`. Uses `all-MiniLM-L6-v2`.
|
|
446
|
+
|
|
447
|
+
### LLM-as-judge
|
|
448
|
+
|
|
449
|
+
```json
|
|
450
|
+
"scoring": {
|
|
451
|
+
"methods": ["fuzzy", "llm_judge"],
|
|
452
|
+
"weights": { "fuzzy": 0.4, "llm_judge": 0.6 },
|
|
453
|
+
"llm_judge_model": {
|
|
454
|
+
"provider": "openai",
|
|
455
|
+
"model_name": "gpt-4o-mini"
|
|
456
|
+
}
|
|
457
|
+
}
|
|
458
|
+
```
|
|
459
|
+
|
|
460
|
+
### Parallel execution
|
|
461
|
+
|
|
462
|
+
```bash
|
|
463
|
+
llm-regtest run --concurrency 10
|
|
464
|
+
```
|
|
465
|
+
|
|
466
|
+
### Tag filtering
|
|
467
|
+
|
|
468
|
+
```bash
|
|
469
|
+
llm-regtest run --tag smoke # fast per-PR smoke suite
|
|
470
|
+
llm-regtest run --tag smoke --tag critical # OR logic
|
|
471
|
+
llm-regtest update-baseline --tag customer-facing
|
|
472
|
+
```
|
|
473
|
+
|
|
474
|
+
### Parameterized inputs
|
|
475
|
+
|
|
476
|
+
```json
|
|
477
|
+
{
|
|
478
|
+
"id": "classify-sentiment",
|
|
479
|
+
"prompt": "Classify as positive, negative, or neutral:",
|
|
480
|
+
"inputs": [
|
|
481
|
+
"Love it!",
|
|
482
|
+
"Broke after one day.",
|
|
483
|
+
"It's fine."
|
|
484
|
+
]
|
|
485
|
+
}
|
|
486
|
+
```
|
|
487
|
+
|
|
488
|
+
Generates sub-cases `classify-sentiment[0]`, `classify-sentiment[1]`, `classify-sentiment[2]`.
|
|
489
|
+
|
|
490
|
+
### Prompt files
|
|
491
|
+
|
|
492
|
+
```json
|
|
493
|
+
{
|
|
494
|
+
"id": "legal-analysis",
|
|
495
|
+
"prompt_file": "prompts/analyze_contract.md",
|
|
496
|
+
"system_prompt_file": "prompts/system/legal_analyst.md",
|
|
497
|
+
"input": "..."
|
|
498
|
+
}
|
|
499
|
+
```
|
|
500
|
+
|
|
501
|
+
Paths are relative to the directory containing `prompt_cases.json`. Prompt files appear as clean diffs in pull requests.
|
|
502
|
+
|
|
503
|
+
### Custom scorers
|
|
504
|
+
|
|
505
|
+
```python
|
|
506
|
+
from llm_regtest.scorer import register_scorer
|
|
507
|
+
|
|
508
|
+
def keyword_overlap(output: str, baseline: str) -> float:
|
|
509
|
+
out_words = set(output.lower().split())
|
|
510
|
+
base_words = set(baseline.lower().split())
|
|
511
|
+
if not base_words:
|
|
512
|
+
return 1.0
|
|
513
|
+
return len(out_words & base_words) / len(base_words)
|
|
514
|
+
|
|
515
|
+
register_scorer("keyword_overlap", keyword_overlap)
|
|
516
|
+
```
|
|
517
|
+
|
|
518
|
+
Then add `"keyword_overlap"` to `methods` in your config.
|
|
519
|
+
|
|
520
|
+
---
|
|
521
|
+
|
|
522
|
+
## CI / GitHub Actions
|
|
523
|
+
|
|
524
|
+
### Generate a workflow file
|
|
525
|
+
|
|
526
|
+
```bash
|
|
527
|
+
llm-regtest init --ci
|
|
528
|
+
```
|
|
529
|
+
|
|
530
|
+
Creates `.github/workflows/prompt-regression.yml` — runs on every PR that touches prompt files.
|
|
531
|
+
|
|
532
|
+
### Annotated PR output
|
|
533
|
+
|
|
534
|
+
```bash
|
|
535
|
+
llm-regtest run --format github
|
|
536
|
+
```
|
|
537
|
+
|
|
538
|
+
Emits `::error::` and `::warning::` lines that GitHub renders as inline annotations on the PR diff.
|
|
539
|
+
|
|
540
|
+
### Example workflow
|
|
541
|
+
|
|
542
|
+
```yaml
|
|
543
|
+
name: Prompt Regression Tests
|
|
544
|
+
on:
|
|
545
|
+
pull_request:
|
|
546
|
+
paths:
|
|
547
|
+
- 'prompts/**'
|
|
548
|
+
- 'prompt_cases.json'
|
|
549
|
+
jobs:
|
|
550
|
+
regression:
|
|
551
|
+
runs-on: ubuntu-latest
|
|
552
|
+
steps:
|
|
553
|
+
- uses: actions/checkout@v4
|
|
554
|
+
- uses: actions/setup-python@v5
|
|
555
|
+
with:
|
|
556
|
+
python-version: '3.11'
|
|
557
|
+
- run: pip install -e ".[openai,semantic]"
|
|
558
|
+
- run: llm-regtest run --format github
|
|
559
|
+
env:
|
|
560
|
+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
|
561
|
+
```
|
|
562
|
+
|
|
563
|
+
Add `OPENAI_API_KEY` as a repository secret under **Settings → Secrets → Actions**.
|
|
564
|
+
|
|
565
|
+
---
|
|
566
|
+
|
|
567
|
+
## Using Claude (Anthropic) as the Model
|
|
568
|
+
|
|
569
|
+
```json
|
|
570
|
+
{
|
|
571
|
+
"model": {
|
|
572
|
+
"provider": "anthropic",
|
|
573
|
+
"model_name": "claude-haiku-4-5-20251001",
|
|
574
|
+
"temperature": 0.0,
|
|
575
|
+
"max_tokens": 1024
|
|
576
|
+
}
|
|
577
|
+
}
|
|
578
|
+
```
|
|
579
|
+
|
|
580
|
+
```bash
|
|
581
|
+
pip install -e ".[anthropic]"
|
|
582
|
+
export ANTHROPIC_API_KEY="sk-ant-..."
|
|
583
|
+
```
|
|
584
|
+
|
|
585
|
+
Available model names: `claude-haiku-4-5-20251001`, `claude-sonnet-4-6`, `claude-opus-4-6`.
|
|
586
|
+
|
|
587
|
+
---
|
|
588
|
+
|
|
589
|
+
## Understanding the Results
|
|
590
|
+
|
|
591
|
+
| Status | Score range | Meaning |
|
|
592
|
+
|--------|------------|---------|
|
|
593
|
+
| **PASS** | ≥ `thresholds.pass` | Response is very similar to baseline |
|
|
594
|
+
| **WARN** | ≥ `thresholds.warn` | Response has drifted noticeably — worth reviewing |
|
|
595
|
+
| **FAIL** | < `thresholds.warn` | Significant regression detected |
|
|
596
|
+
| **SKIP** | — | No baseline exists for this case yet |
|
|
597
|
+
|
|
598
|
+
Default thresholds: `pass = 0.8`, `warn = 0.5`. Adjust in `config.json` under `scoring.thresholds`.
|
|
599
|
+
|
|
600
|
+
---
|
|
601
|
+
|
|
602
|
+
## Troubleshooting
|
|
603
|
+
|
|
604
|
+
**"No module named llm_regtest"**
|
|
605
|
+
Run `pip install -e .` from the project root directory.
|
|
606
|
+
|
|
607
|
+
**"OPENAI_API_KEY not set" or authentication errors**
|
|
608
|
+
Set the environment variable in the same terminal window you're running the tool from.
|
|
609
|
+
|
|
610
|
+
**All tests show "skip"**
|
|
611
|
+
No baselines yet. Run `llm-regtest update-baseline` first.
|
|
612
|
+
|
|
613
|
+
**Semantic scorer not available**
|
|
614
|
+
Install with `pip install -e ".[semantic]"`. The scorer is silently skipped if `sentence-transformers` is not installed.
|
|
615
|
+
|
|
616
|
+
**Scores are lower than expected after a small change**
|
|
617
|
+
Switch from `exact` to `fuzzy` or `semantic` scoring, which are more tolerant of minor wording differences.
|
|
618
|
+
|
|
619
|
+
**Runs are slow with many cases**
|
|
620
|
+
Use `--concurrency N` to run cases in parallel. Start with `5` and increase if your API rate limits allow.
|