llm-regtest 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (30) hide show
  1. llm_regtest-0.1.0/PKG-INFO +620 -0
  2. llm_regtest-0.1.0/README.md +593 -0
  3. llm_regtest-0.1.0/pyproject.toml +30 -0
  4. llm_regtest-0.1.0/setup.cfg +4 -0
  5. llm_regtest-0.1.0/src/llm_regtest/__init__.py +3 -0
  6. llm_regtest-0.1.0/src/llm_regtest/__main__.py +5 -0
  7. llm_regtest-0.1.0/src/llm_regtest/analyzer.py +136 -0
  8. llm_regtest-0.1.0/src/llm_regtest/cli.py +421 -0
  9. llm_regtest-0.1.0/src/llm_regtest/config.py +136 -0
  10. llm_regtest-0.1.0/src/llm_regtest/loader.py +132 -0
  11. llm_regtest-0.1.0/src/llm_regtest/models/__init__.py +47 -0
  12. llm_regtest-0.1.0/src/llm_regtest/models/anthropic_model.py +85 -0
  13. llm_regtest-0.1.0/src/llm_regtest/models/base.py +53 -0
  14. llm_regtest-0.1.0/src/llm_regtest/models/openai_model.py +90 -0
  15. llm_regtest-0.1.0/src/llm_regtest/reporter.py +244 -0
  16. llm_regtest-0.1.0/src/llm_regtest/runner.py +125 -0
  17. llm_regtest-0.1.0/src/llm_regtest/scorer.py +189 -0
  18. llm_regtest-0.1.0/src/llm_regtest/tests/__init__.py +0 -0
  19. llm_regtest-0.1.0/src/llm_regtest/tests/conftest.py +57 -0
  20. llm_regtest-0.1.0/src/llm_regtest/tests/test_analyzer.py +98 -0
  21. llm_regtest-0.1.0/src/llm_regtest/tests/test_cli.py +83 -0
  22. llm_regtest-0.1.0/src/llm_regtest/tests/test_loader.py +83 -0
  23. llm_regtest-0.1.0/src/llm_regtest/tests/test_scorer.py +117 -0
  24. llm_regtest-0.1.0/src/llm_regtest/types.py +100 -0
  25. llm_regtest-0.1.0/src/llm_regtest.egg-info/PKG-INFO +620 -0
  26. llm_regtest-0.1.0/src/llm_regtest.egg-info/SOURCES.txt +28 -0
  27. llm_regtest-0.1.0/src/llm_regtest.egg-info/dependency_links.txt +1 -0
  28. llm_regtest-0.1.0/src/llm_regtest.egg-info/entry_points.txt +2 -0
  29. llm_regtest-0.1.0/src/llm_regtest.egg-info/requires.txt +24 -0
  30. llm_regtest-0.1.0/src/llm_regtest.egg-info/top_level.txt +1 -0
@@ -0,0 +1,620 @@
1
+ Metadata-Version: 2.4
2
+ Name: llm-regtest
3
+ Version: 0.1.0
4
+ Summary: Lightweight regression testing for LLM prompts
5
+ Author: Libin Samkutty
6
+ License: MIT
7
+ Requires-Python: >=3.10
8
+ Description-Content-Type: text/markdown
9
+ Provides-Extra: openai
10
+ Requires-Dist: openai>=1.0; extra == "openai"
11
+ Provides-Extra: anthropic
12
+ Requires-Dist: anthropic>=0.18; extra == "anthropic"
13
+ Provides-Extra: yaml
14
+ Requires-Dist: pyyaml>=6.0; extra == "yaml"
15
+ Provides-Extra: semantic
16
+ Requires-Dist: sentence-transformers>=2.2; extra == "semantic"
17
+ Requires-Dist: numpy>=1.24; extra == "semantic"
18
+ Provides-Extra: all
19
+ Requires-Dist: openai>=1.0; extra == "all"
20
+ Requires-Dist: anthropic>=0.18; extra == "all"
21
+ Requires-Dist: pyyaml>=6.0; extra == "all"
22
+ Requires-Dist: sentence-transformers>=2.2; extra == "all"
23
+ Requires-Dist: numpy>=1.24; extra == "all"
24
+ Provides-Extra: dev
25
+ Requires-Dist: pytest>=7.0; extra == "dev"
26
+ Requires-Dist: pytest-cov>=4.0; extra == "dev"
27
+
28
+ # llm-regtest
29
+
30
+ A lightweight regression testing framework for LLM prompts. Catch semantic drift before it ships.
31
+
32
+ ---
33
+
34
+ ## What does this tool do?
35
+
36
+ When you use an AI model, the exact wording of your instructions — called a **prompt** — has a big effect on the quality of responses. Small wording changes can cause unexpected regressions that are impossible to spot without running tests.
37
+
38
+ This tool works like a test suite for prompts:
39
+
40
+ 1. You define test cases with prompts and optional inputs
41
+ 2. Run `update-baseline` — the AI's responses become your **golden dataset**
42
+ 3. Later, after changing a prompt or upgrading a model, run `llm-regtest run`
43
+ 4. The tool compares new responses to the baselines and flags anything that regressed
44
+
45
+ Each comparison produces a **score from 0.0 to 1.0** using one or more methods:
46
+
47
+ | Method | How it works | Best for |
48
+ |--------|-------------|----------|
49
+ | `exact` | Character-for-character match (0 or 1) | Classification labels, structured outputs |
50
+ | `fuzzy` | Levenshtein edit-distance ratio | General text where minor wording drift is acceptable |
51
+ | `semantic` | Cosine similarity of sentence embeddings (`all-MiniLM-L6-v2`) | Longer outputs where meaning matters more than wording |
52
+ | `llm_judge` | An LLM rates quality similarity 0.0–1.0 | Conversational outputs where both fuzzy and semantic are too strict |
53
+
54
+ Results are bucketed into **PASS / WARN / FAIL** based on configurable score thresholds.
55
+
56
+ ---
57
+
58
+ ## What you need before starting
59
+
60
+ - **Python 3.10 or newer** ([download here](https://www.python.org/downloads/))
61
+ - **pip** — comes with Python automatically
62
+ - **An API key** — for OpenAI ([get one here](https://platform.openai.com/api-keys)) or Anthropic ([get one here](https://console.anthropic.com/))
63
+
64
+ To check your Python version:
65
+
66
+ ```bash
67
+ python --version
68
+ ```
69
+
70
+ ---
71
+
72
+ ## Installation
73
+
74
+ ```bash
75
+ # OpenAI support
76
+ pip install -e ".[openai]"
77
+
78
+ # Anthropic (Claude) support
79
+ pip install -e ".[anthropic]"
80
+
81
+ # Semantic similarity scoring (sentence-transformers)
82
+ pip install -e ".[semantic]"
83
+
84
+ # Everything
85
+ pip install -e ".[all]"
86
+ ```
87
+
88
+ > The `[semantic]` extra installs `sentence-transformers` and `numpy`. The first time you use semantic scoring, the `all-MiniLM-L6-v2` model (~80 MB) is downloaded automatically.
89
+
90
+ ---
91
+
92
+ ## Quickstart
93
+
94
+ ### 1. Initialise the project
95
+
96
+ ```bash
97
+ llm-regtest init
98
+ ```
99
+
100
+ Creates `.promptregtest/config.json` and a `prompt_cases.json` file.
101
+
102
+ ### 2. Set your API key
103
+
104
+ ```bash
105
+ # Mac / Linux
106
+ export OPENAI_API_KEY="sk-..."
107
+
108
+ # Windows PowerShell
109
+ $env:OPENAI_API_KEY="sk-..."
110
+ ```
111
+
112
+ ### 3. Define test cases
113
+
114
+ Edit `prompt_cases.json`:
115
+
116
+ ```json
117
+ [
118
+ {
119
+ "id": "summarize-article",
120
+ "prompt": "Summarize the following text in one sentence:",
121
+ "input": "Scientists discovered that short 10-minute walks after meals reduce blood sugar spikes.",
122
+ "baseline_output": "",
123
+ "tags": ["summarization"]
124
+ },
125
+ {
126
+ "id": "sentiment-label",
127
+ "prompt": "Classify the sentiment. Reply with exactly one word: positive, negative, or neutral.",
128
+ "input": "I absolutely loved the product!",
129
+ "baseline_output": "",
130
+ "tags": ["classification"]
131
+ }
132
+ ]
133
+ ```
134
+
135
+ ### 4. Generate baselines
136
+
137
+ ```bash
138
+ llm-regtest update-baseline
139
+ ```
140
+
141
+ This calls the AI for every case and saves the responses as baselines inside `prompt_cases.json`.
142
+
143
+ ### 5. Run the regression tests
144
+
145
+ ```bash
146
+ llm-regtest run
147
+ ```
148
+
149
+ Sample output:
150
+
151
+ ```
152
+ [PASS] summarize-article (fuzzy: 0.94, semantic: 0.97, agg=0.96) [312ms, $0.00018]
153
+ [WARN] email-rewrite (fuzzy: 0.71, semantic: 0.79, agg=0.77) [289ms, $0.00014]
154
+ [FAIL] sentiment-label (exact: 0.00, fuzzy: 0.43, agg=0.21) [198ms, $0.00009]
155
+
156
+ ------------------------------------------------
157
+ Results (3 cases): 1 passed, 1 warned, 1 failed
158
+ Cost: $0.00041 | Latency: 266.3ms avg
159
+ ------------------------------------------------
160
+ ```
161
+
162
+ Each result line now shows **latency** (ms) and **cost** (USD) alongside the score. The summary line aggregates total cost and average latency across the run.
163
+
164
+ ### 6. View a saved report
165
+
166
+ ```bash
167
+ llm-regtest report
168
+ ```
169
+
170
+ ---
171
+
172
+ ## How the Regression Check Works
173
+
174
+ ```
175
+ prompt_cases.json .promptregtest/baselines/
176
+ ┌──────────────────┐ ┌───────────────────────────┐
177
+ │ id: "summarize" │ │ summarize.txt │
178
+ │ prompt: "..." │──run──▶ │ "Short 10-min walks..." │◀── baseline
179
+ │ input: "..." │ └───────────────────────────┘
180
+ └──────────────────┘ │
181
+ │ compare
182
+ New run output ──────────────────┘
183
+ "Brief post-meal walks..."
184
+
185
+
186
+ scorer.semantic_similarity(new, baseline)
187
+ = cosine(embed(new), embed(baseline))
188
+ = 0.91 → PASS (threshold: 0.85)
189
+ ```
190
+
191
+ On each run, the tool:
192
+
193
+ 1. Loads every case from `prompt_cases.json`
194
+ 2. Sends the prompt (+ optional input) to the configured model
195
+ 3. Measures **latency** (wall-clock ms) and **token counts** for that call
196
+ 4. Computes **USD cost** from the provider's pricing table
197
+ 5. Runs each configured scorer against the stored baseline
198
+ 6. Computes a **weighted aggregate score**
199
+ 7. Compares it to the `pass` / `warn` thresholds
200
+ 8. Saves a JSON report to `.promptregtest/reports/`
201
+
202
+ ---
203
+
204
+ ## Cost and Latency Tracking
205
+
206
+ Every run automatically captures:
207
+
208
+ | Metric | Where it appears |
209
+ |--------|-----------------|
210
+ | Per-request latency (ms) | Beside each result: `[312ms, $0.00018]` |
211
+ | Per-request USD cost | Beside each result: `[312ms, $0.00018]` |
212
+ | Total run cost | Summary line: `Cost: $0.00041` |
213
+ | Average latency | Summary line: `Latency: 266.3ms avg` |
214
+ | Input / output token counts | Saved in the JSON report |
215
+
216
+ Cost is calculated from built-in pricing tables for common OpenAI and Anthropic models. Unknown models report `$0.00000` (no crash). The JSON report stores `input_tokens`, `output_tokens`, `cost_usd`, and `latency_ms` per case for later analysis.
217
+
218
+ ---
219
+
220
+ ## Semantic Similarity Scoring
221
+
222
+ The `semantic` scorer encodes both the new output and the baseline as sentence embeddings, then computes their **cosine similarity**. Two responses that express the same idea in different words score close to 1.0; responses with entirely different meaning score near 0.0.
223
+
224
+ **Install:**
225
+
226
+ ```bash
227
+ pip install -e ".[semantic]"
228
+ ```
229
+
230
+ **Configure:**
231
+
232
+ ```json
233
+ {
234
+ "scoring": {
235
+ "methods": ["fuzzy", "semantic"],
236
+ "weights": { "fuzzy": 0.3, "semantic": 0.7 },
237
+ "thresholds": { "pass": 0.85, "warn": 0.65 }
238
+ }
239
+ }
240
+ ```
241
+
242
+ The `all-MiniLM-L6-v2` model is downloaded once and cached in memory across cases in the same run. If `sentence-transformers` is not installed, the scorer is simply not registered — no crash on import. It will only fail if you explicitly request `"semantic"` in your config without the package installed.
243
+
244
+ ---
245
+
246
+ ## Comparing Two Versions of a Prompt
247
+
248
+ The most common workflow: lock in version A, change the prompt, compare.
249
+
250
+ **1. Generate the baseline (version A)**
251
+
252
+ ```bash
253
+ llm-regtest update-baseline
254
+ ```
255
+
256
+ **2. Modify your prompt in `prompt_cases.json`**
257
+
258
+ ```json
259
+ "prompt": "Write a one-sentence summary focusing on the key finding:"
260
+ ```
261
+
262
+ **3. Run the comparison**
263
+
264
+ ```bash
265
+ llm-regtest run --verbose
266
+ ```
267
+
268
+ `--verbose` also prints a line-by-line diff between the baseline and the new output for each case.
269
+
270
+ **4. Compare two saved reports side-by-side**
271
+
272
+ ```bash
273
+ llm-regtest compare \
274
+ --report-a .promptregtest/reports/report_before.json \
275
+ --report-b .promptregtest/reports/report_after.json
276
+ ```
277
+
278
+ Output:
279
+
280
+ ```
281
+ Case score-a score-b delta change
282
+ ------------------------------------------------------
283
+ email-reply 0.73 0.95 +0.22 IMPROVED
284
+ summarize-article 0.91 0.84 -0.07 REGRESSED
285
+ sentiment-label 1.00 1.00 +0.00 unchanged
286
+
287
+ 1 improved, 1 regressed, 1 unchanged
288
+ ```
289
+
290
+ ---
291
+
292
+ ## End-to-End OpenAI Demo
293
+
294
+ A complete demo script is included that shows the full workflow with real OpenAI API calls — including baseline generation, clean regression run, drift simulation, and A/B comparison.
295
+
296
+ **Prerequisites:**
297
+
298
+ ```bash
299
+ pip install "llm-regtest[openai,semantic]"
300
+ export OPENAI_API_KEY="sk-..."
301
+ ```
302
+
303
+ **Run:**
304
+
305
+ ```bash
306
+ python examples/demo_openai.py
307
+ ```
308
+
309
+ The script runs four steps automatically:
310
+
311
+ | Step | What happens |
312
+ |------|-------------|
313
+ | 1/4 Generate baselines | Calls `gpt-4o-mini` for all 8 cases and saves responses |
314
+ | 2/4 Regression run | Reruns all cases — scores should be near 1.0 |
315
+ | 3/4 Simulate drift | Rewrites two prompt wordings to mimic a real prompt change |
316
+ | 4/4 Detect regression | Reruns with drifted prompts — expect WARN/FAIL on changed cases |
317
+ | Bonus A/B compare | Side-by-side delta table between the two runs |
318
+
319
+ The 8 demo cases cover summarization, tone rewriting, sentiment classification, and factual Q&A — a representative spread for testing how different task types respond to prompt drift.
320
+
321
+ ---
322
+
323
+ ## Configuration Reference
324
+
325
+ Default config lives at `.promptregtest/config.json`:
326
+
327
+ ```json
328
+ {
329
+ "model": {
330
+ "provider": "openai",
331
+ "model_name": "gpt-4o-mini",
332
+ "temperature": 0.0,
333
+ "max_tokens": 1024,
334
+ "system_prompt": ""
335
+ },
336
+ "scoring": {
337
+ "methods": ["fuzzy", "semantic"],
338
+ "weights": { "fuzzy": 0.3, "semantic": 0.7 },
339
+ "thresholds": {
340
+ "pass": 0.85,
341
+ "warn": 0.65
342
+ },
343
+ "llm_judge_model": null
344
+ },
345
+ "prompt_cases_path": "prompt_cases.json",
346
+ "reports_dir": ".promptregtest/reports",
347
+ "baselines_dir": ".promptregtest/baselines",
348
+ "concurrency": 1
349
+ }
350
+ ```
351
+
352
+ | Setting | What it does |
353
+ |---------|-------------|
354
+ | `provider` | `"openai"`, `"anthropic"`, or `"stub"` (no API key needed, for testing) |
355
+ | `model_name` | Model ID, e.g. `"gpt-4o-mini"`, `"claude-sonnet-4-6"` |
356
+ | `temperature` | Set to `0.0` for deterministic, repeatable outputs — strongly recommended for testing |
357
+ | `max_tokens` | Maximum response length |
358
+ | `system_prompt` | Global system-level instruction sent with every case |
359
+ | `methods` | Scorers to use: any combination of `"exact"`, `"fuzzy"`, `"semantic"`, `"llm_judge"` |
360
+ | `weights` | Per-method weights for the aggregate. Omit for equal weighting |
361
+ | `thresholds.pass` | Score at or above this → PASS (default: `0.8`) |
362
+ | `thresholds.warn` | Score at or above this → WARN (default: `0.5`); below → FAIL |
363
+ | `llm_judge_model` | Model config for the LLM-as-judge scorer (same shape as `model`) |
364
+ | `concurrency` | Cases to run in parallel (default: `1`) |
365
+
366
+ ---
367
+
368
+ ## Test Case Fields
369
+
370
+ ```json
371
+ {
372
+ "id": "my-test",
373
+ "prompt": "Summarize in one sentence:",
374
+ "prompt_file": "prompts/summarize.md",
375
+ "system_prompt": "You are a concise assistant.",
376
+ "system_prompt_file": "prompts/system/concise.md",
377
+ "input": "Text to summarize goes here.",
378
+ "inputs": ["Input A", "Input B", "Input C"],
379
+ "inputs_file": "fixtures/reviews.json",
380
+ "variables": { "name": "Alice", "role": "engineer" },
381
+ "baseline_output": "",
382
+ "tags": ["smoke", "summarization"]
383
+ }
384
+ ```
385
+
386
+ | Field | Required | What it does |
387
+ |-------|----------|--------------|
388
+ | `id` | Yes | Unique identifier (no spaces) |
389
+ | `prompt` | Yes* | The instruction text |
390
+ | `prompt_file` | Yes* | Path to a `.txt` or `.md` file containing the prompt |
391
+ | `system_prompt` | No | Per-case system prompt (overrides global config) |
392
+ | `system_prompt_file` | No | Path to a file containing the system prompt |
393
+ | `input` | No | Extra text appended to the prompt |
394
+ | `inputs` | No | List of inputs — generates `id[0]`, `id[1]`, ... sub-cases |
395
+ | `inputs_file` | No | JSON file with a list of input strings |
396
+ | `variables` | No | Values for `{placeholder}` templates in the prompt |
397
+ | `baseline_output` | No | Auto-filled by `update-baseline` — leave blank |
398
+ | `tags` | No | Labels for filtering with `--tag` |
399
+
400
+ *`prompt` or `prompt_file` is required, not both.
401
+
402
+ ---
403
+
404
+ ## CLI Reference
405
+
406
+ | Command | What it does |
407
+ |---------|-------------|
408
+ | `llm-regtest init` | Create the project folder structure |
409
+ | `llm-regtest init --ci` | Also create a GitHub Actions workflow file |
410
+ | `llm-regtest update-baseline` | Run prompts and save responses as new baselines |
411
+ | `llm-regtest run` | Run prompts and compare to existing baselines |
412
+ | `llm-regtest report` | Display the most recent saved report |
413
+ | `llm-regtest compare` | Compare two saved reports side-by-side |
414
+
415
+ **Flags for `run` and `update-baseline`:**
416
+
417
+ | Flag | What it does |
418
+ |------|-------------|
419
+ | `--config PATH` | Use a non-default config file |
420
+ | `--case ID` | Run only this case ID (repeatable) |
421
+ | `--tag TAG` | Run only cases with this tag (repeatable, OR logic) |
422
+ | `--concurrency N` | Run N cases in parallel |
423
+
424
+ **Flags for `run` only:**
425
+
426
+ | Flag | What it does |
427
+ |------|-------------|
428
+ | `--verbose` / `-v` | Print a unified diff for each case |
429
+ | `--format console` | Default coloured output |
430
+ | `--format github` | GitHub Actions `::error::` / `::warning::` annotations |
431
+
432
+ ---
433
+
434
+ ## Advanced Features
435
+
436
+ ### Semantic similarity
437
+
438
+ ```json
439
+ "scoring": {
440
+ "methods": ["fuzzy", "semantic"],
441
+ "weights": { "fuzzy": 0.3, "semantic": 0.7 }
442
+ }
443
+ ```
444
+
445
+ Requires `pip install -e ".[semantic]"`. Uses `all-MiniLM-L6-v2`.
446
+
447
+ ### LLM-as-judge
448
+
449
+ ```json
450
+ "scoring": {
451
+ "methods": ["fuzzy", "llm_judge"],
452
+ "weights": { "fuzzy": 0.4, "llm_judge": 0.6 },
453
+ "llm_judge_model": {
454
+ "provider": "openai",
455
+ "model_name": "gpt-4o-mini"
456
+ }
457
+ }
458
+ ```
459
+
460
+ ### Parallel execution
461
+
462
+ ```bash
463
+ llm-regtest run --concurrency 10
464
+ ```
465
+
466
+ ### Tag filtering
467
+
468
+ ```bash
469
+ llm-regtest run --tag smoke # fast per-PR smoke suite
470
+ llm-regtest run --tag smoke --tag critical # OR logic
471
+ llm-regtest update-baseline --tag customer-facing
472
+ ```
473
+
474
+ ### Parameterized inputs
475
+
476
+ ```json
477
+ {
478
+ "id": "classify-sentiment",
479
+ "prompt": "Classify as positive, negative, or neutral:",
480
+ "inputs": [
481
+ "Love it!",
482
+ "Broke after one day.",
483
+ "It's fine."
484
+ ]
485
+ }
486
+ ```
487
+
488
+ Generates sub-cases `classify-sentiment[0]`, `classify-sentiment[1]`, `classify-sentiment[2]`.
489
+
490
+ ### Prompt files
491
+
492
+ ```json
493
+ {
494
+ "id": "legal-analysis",
495
+ "prompt_file": "prompts/analyze_contract.md",
496
+ "system_prompt_file": "prompts/system/legal_analyst.md",
497
+ "input": "..."
498
+ }
499
+ ```
500
+
501
+ Paths are relative to the directory containing `prompt_cases.json`. Prompt files appear as clean diffs in pull requests.
502
+
503
+ ### Custom scorers
504
+
505
+ ```python
506
+ from llm_regtest.scorer import register_scorer
507
+
508
+ def keyword_overlap(output: str, baseline: str) -> float:
509
+ out_words = set(output.lower().split())
510
+ base_words = set(baseline.lower().split())
511
+ if not base_words:
512
+ return 1.0
513
+ return len(out_words & base_words) / len(base_words)
514
+
515
+ register_scorer("keyword_overlap", keyword_overlap)
516
+ ```
517
+
518
+ Then add `"keyword_overlap"` to `methods` in your config.
519
+
520
+ ---
521
+
522
+ ## CI / GitHub Actions
523
+
524
+ ### Generate a workflow file
525
+
526
+ ```bash
527
+ llm-regtest init --ci
528
+ ```
529
+
530
+ Creates `.github/workflows/prompt-regression.yml` — runs on every PR that touches prompt files.
531
+
532
+ ### Annotated PR output
533
+
534
+ ```bash
535
+ llm-regtest run --format github
536
+ ```
537
+
538
+ Emits `::error::` and `::warning::` lines that GitHub renders as inline annotations on the PR diff.
539
+
540
+ ### Example workflow
541
+
542
+ ```yaml
543
+ name: Prompt Regression Tests
544
+ on:
545
+ pull_request:
546
+ paths:
547
+ - 'prompts/**'
548
+ - 'prompt_cases.json'
549
+ jobs:
550
+ regression:
551
+ runs-on: ubuntu-latest
552
+ steps:
553
+ - uses: actions/checkout@v4
554
+ - uses: actions/setup-python@v5
555
+ with:
556
+ python-version: '3.11'
557
+ - run: pip install -e ".[openai,semantic]"
558
+ - run: llm-regtest run --format github
559
+ env:
560
+ OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
561
+ ```
562
+
563
+ Add `OPENAI_API_KEY` as a repository secret under **Settings → Secrets → Actions**.
564
+
565
+ ---
566
+
567
+ ## Using Claude (Anthropic) as the Model
568
+
569
+ ```json
570
+ {
571
+ "model": {
572
+ "provider": "anthropic",
573
+ "model_name": "claude-haiku-4-5-20251001",
574
+ "temperature": 0.0,
575
+ "max_tokens": 1024
576
+ }
577
+ }
578
+ ```
579
+
580
+ ```bash
581
+ pip install -e ".[anthropic]"
582
+ export ANTHROPIC_API_KEY="sk-ant-..."
583
+ ```
584
+
585
+ Available model names: `claude-haiku-4-5-20251001`, `claude-sonnet-4-6`, `claude-opus-4-6`.
586
+
587
+ ---
588
+
589
+ ## Understanding the Results
590
+
591
+ | Status | Score range | Meaning |
592
+ |--------|------------|---------|
593
+ | **PASS** | ≥ `thresholds.pass` | Response is very similar to baseline |
594
+ | **WARN** | ≥ `thresholds.warn` | Response has drifted noticeably — worth reviewing |
595
+ | **FAIL** | < `thresholds.warn` | Significant regression detected |
596
+ | **SKIP** | — | No baseline exists for this case yet |
597
+
598
+ Default thresholds: `pass = 0.8`, `warn = 0.5`. Adjust in `config.json` under `scoring.thresholds`.
599
+
600
+ ---
601
+
602
+ ## Troubleshooting
603
+
604
+ **"No module named llm_regtest"**
605
+ Run `pip install -e .` from the project root directory.
606
+
607
+ **"OPENAI_API_KEY not set" or authentication errors**
608
+ Set the environment variable in the same terminal window you're running the tool from.
609
+
610
+ **All tests show "skip"**
611
+ No baselines yet. Run `llm-regtest update-baseline` first.
612
+
613
+ **Semantic scorer not available**
614
+ Install with `pip install -e ".[semantic]"`. The scorer is silently skipped if `sentence-transformers` is not installed.
615
+
616
+ **Scores are lower than expected after a small change**
617
+ Switch from `exact` to `fuzzy` or `semantic` scoring, which are more tolerant of minor wording differences.
618
+
619
+ **Runs are slow with many cases**
620
+ Use `--concurrency N` to run cases in parallel. Start with `5` and increase if your API rate limits allow.