llm-judge-kit 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (32) hide show
  1. llm_judge_kit-0.1.0/.gitignore +34 -0
  2. llm_judge_kit-0.1.0/CHANGELOG.md +75 -0
  3. llm_judge_kit-0.1.0/LICENSE +21 -0
  4. llm_judge_kit-0.1.0/PKG-INFO +266 -0
  5. llm_judge_kit-0.1.0/README.md +232 -0
  6. llm_judge_kit-0.1.0/pyproject.toml +141 -0
  7. llm_judge_kit-0.1.0/src/llm_judge_kit/__init__.py +98 -0
  8. llm_judge_kit-0.1.0/src/llm_judge_kit/_logging.py +51 -0
  9. llm_judge_kit-0.1.0/src/llm_judge_kit/_version.py +9 -0
  10. llm_judge_kit-0.1.0/src/llm_judge_kit/benchmark.py +91 -0
  11. llm_judge_kit-0.1.0/src/llm_judge_kit/caching.py +63 -0
  12. llm_judge_kit-0.1.0/src/llm_judge_kit/cli.py +153 -0
  13. llm_judge_kit-0.1.0/src/llm_judge_kit/consensus.py +123 -0
  14. llm_judge_kit-0.1.0/src/llm_judge_kit/dataset.py +110 -0
  15. llm_judge_kit-0.1.0/src/llm_judge_kit/errors.py +43 -0
  16. llm_judge_kit-0.1.0/src/llm_judge_kit/judge.py +192 -0
  17. llm_judge_kit-0.1.0/src/llm_judge_kit/parsing.py +127 -0
  18. llm_judge_kit-0.1.0/src/llm_judge_kit/providers/__init__.py +40 -0
  19. llm_judge_kit-0.1.0/src/llm_judge_kit/providers/anthropic.py +84 -0
  20. llm_judge_kit-0.1.0/src/llm_judge_kit/providers/base.py +33 -0
  21. llm_judge_kit-0.1.0/src/llm_judge_kit/providers/mock.py +81 -0
  22. llm_judge_kit-0.1.0/src/llm_judge_kit/providers/ollama.py +81 -0
  23. llm_judge_kit-0.1.0/src/llm_judge_kit/providers/openai.py +81 -0
  24. llm_judge_kit-0.1.0/src/llm_judge_kit/providers/registry.py +67 -0
  25. llm_judge_kit-0.1.0/src/llm_judge_kit/py.typed +0 -0
  26. llm_judge_kit-0.1.0/src/llm_judge_kit/reliability.py +100 -0
  27. llm_judge_kit-0.1.0/src/llm_judge_kit/reporting.py +197 -0
  28. llm_judge_kit-0.1.0/src/llm_judge_kit/rubrics/__init__.py +33 -0
  29. llm_judge_kit-0.1.0/src/llm_judge_kit/rubrics/base.py +118 -0
  30. llm_judge_kit-0.1.0/src/llm_judge_kit/rubrics/builtin.py +77 -0
  31. llm_judge_kit-0.1.0/src/llm_judge_kit/types.py +91 -0
  32. llm_judge_kit-0.1.0/src/pytest_llm_judge_kit.py +118 -0
@@ -0,0 +1,34 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.egg-info/
5
+ .eggs/
6
+ build/
7
+ dist/
8
+ *.whl
9
+
10
+ # Environments / secrets
11
+ .venv/
12
+ venv/
13
+ .env
14
+ .env.*
15
+ !.env.example
16
+
17
+ # Tooling caches
18
+ .pytest_cache/
19
+ .mypy_cache/
20
+ .ruff_cache/
21
+ .coverage
22
+ .coverage.*
23
+ htmlcov/
24
+ coverage.xml
25
+
26
+ # OS / editors
27
+ .DS_Store
28
+ .idea/
29
+ .vscode/
30
+
31
+ # Local-only working notes (not part of the public project)
32
+ CLAUDE.md
33
+ PROMPT.md
34
+ docs/strategy.md
@@ -0,0 +1,75 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project are documented here.
4
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
5
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6
+
7
+ ## [Unreleased]
8
+
9
+ ### Added
10
+ - Project scaffold: `pyproject.toml`, tooling gate (ruff + mypy strict + pytest
11
+ with ≥95% coverage), CI workflow, MIT license, contributor docs.
12
+ - **Wedge core (M1):**
13
+ - `Judge` — pairs a provider with a rubric; `score()` returns a typed
14
+ `JudgeResult`; `__call__` alias and `passed()` convenience.
15
+ - `JudgeResult` / `ProviderResponse` — frozen, typed; `JudgeResult` casts
16
+ (`__float__`) and compares (`<`, `<=`, `>`, `>=`) like its score.
17
+ - `Provider` protocol + `BaseProvider`; deterministic offline `MockProvider`.
18
+ - Provider registry with `"scheme:model"` spec parsing (`register_provider`,
19
+ `make_provider`, `available_providers`).
20
+ - `Rubric` + registry and five built-ins: factuality, groundedness,
21
+ relevance, instruction_following, safety (`register_rubric`, `get_rubric`,
22
+ `available_rubrics`).
23
+ - Robust `extract_json` parser (markdown fences, surrounding prose, trailing
24
+ commas, nested braces) with a clear `ParseError`.
25
+ - Typed exception hierarchy (`LLMJudgeError` and subclasses).
26
+ - Runnable examples and a README quickstart that executes offline.
27
+
28
+ - **Real providers (M2):**
29
+ - `OpenAIProvider` (Chat Completions; works with any OpenAI-compatible
30
+ `base_url`), `AnthropicProvider` (Messages API), `OllamaProvider` (local,
31
+ via httpx). Registered under the `openai:`/`anthropic:`/`ollama:` schemes.
32
+ - SDKs are imported lazily — the core stays dependency-free; the import only
33
+ happens when a real client is built, with a clear error pointing at the
34
+ extra to install.
35
+ - Offline unit tests drive every provider through injected fake clients;
36
+ live API tests are gated behind `LLM_JUDGE_KIT_LIVE_TESTS=1` (skipped by default).
37
+
38
+ - **Consensus + reliability (M3):**
39
+ - `ConsensusJudge` and `Judge.consensus([...], rubric=...)` — aggregate
40
+ several judges (mean/median); `confidence` is derived from agreement
41
+ (tight spread → high confidence) and member votes land in `metadata`.
42
+ - `RetryProvider` — composable retry-with-backoff and optional per-call
43
+ timeout wrapper (injectable sleep for deterministic tests).
44
+ - `CachingProvider` — memoizes completions; key = hash of
45
+ `version + provider + model + prompt + kwargs` (pluggable store).
46
+ - Structured logging on the `llm_judge_kit` logger (ships a `NullHandler`;
47
+ `enable_debug_logging()` for local debugging).
48
+ - Version moved to `_version.py` (single source; hatchling dynamic version).
49
+
50
+ - **Adoption surface (M4):**
51
+ - Bundled pytest plugin (`pytest11` entry point → top-level `pytest_llm_judge_kit`
52
+ module): the `llm_judge_kit` fixture and `JudgeHelper.assert_passes` let you
53
+ write evals as ordinary pytest tests, with score/reason/violations in the
54
+ failure message. Choose the model with `--llm-judge-kit-provider` or
55
+ `$LLM_JUDGE_KIT_PROVIDER` (defaults to `mock`, so suites run offline).
56
+ - README "Integrations" section, including a framework-agnostic note (works
57
+ with LangChain / LlamaIndex / any pipeline — it judges strings).
58
+
59
+ - **Platform layer (M5):**
60
+ - `llm_judge_kit` CLI (argparse, no new deps): `eval` (run a dataset → report,
61
+ with `--fail-under` for CI gating), `compare` (several providers side by
62
+ side), `report` (re-render a saved JSON report).
63
+ - Dataset loader (`load_dataset`, `Case`) for `.jsonl`/`.ndjson`/`.json`.
64
+ - Benchmark engine (`run_benchmark`, `Report` with count/passed/pass_rate/
65
+ mean_score), kept separate from reporting.
66
+ - Reporting (`render_json`/`render_markdown`/`render_html`, `load_report`) —
67
+ JSON round-trips back into a `Report`.
68
+
69
+ ### Hardened (M1 adversarial review)
70
+ - Trailing-comma JSON repair is now string-aware — it no longer corrupts string
71
+ values that contain `,}` or `,]`.
72
+ - Non-finite scores/confidence (`NaN`, `Infinity`) are rejected/defaulted
73
+ instead of silently clamping to a perfect `1.0`.
74
+ - `JudgeResult` and `ProviderResponse` are now genuinely hashable (the `dict`
75
+ `metadata` field is excluded from the hash).
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 LLMJudge contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,266 @@
1
+ Metadata-Version: 2.4
2
+ Name: llm-judge-kit
3
+ Version: 0.1.0
4
+ Summary: Provider-agnostic, reproducible, typed LLM-as-a-judge — a small primitive you can depend on.
5
+ Project-URL: Homepage, https://github.com/Nikolay26ru/llm-judge-kit
6
+ Project-URL: Repository, https://github.com/Nikolay26ru/llm-judge-kit
7
+ Project-URL: Changelog, https://github.com/Nikolay26ru/llm-judge-kit/blob/main/CHANGELOG.md
8
+ Author: LLMJudge contributors
9
+ License-Expression: MIT
10
+ License-File: LICENSE
11
+ Keywords: eval,evaluation,llm,llm-as-a-judge,rubric,testing
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3 :: Only
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Programming Language :: Python :: 3.13
19
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
20
+ Classifier: Topic :: Software Development :: Testing
21
+ Classifier: Typing :: Typed
22
+ Requires-Python: >=3.11
23
+ Provides-Extra: all
24
+ Requires-Dist: anthropic>=0.40; extra == 'all'
25
+ Requires-Dist: httpx>=0.27; extra == 'all'
26
+ Requires-Dist: openai>=1.0; extra == 'all'
27
+ Provides-Extra: anthropic
28
+ Requires-Dist: anthropic>=0.40; extra == 'anthropic'
29
+ Provides-Extra: ollama
30
+ Requires-Dist: httpx>=0.27; extra == 'ollama'
31
+ Provides-Extra: openai
32
+ Requires-Dist: openai>=1.0; extra == 'openai'
33
+ Description-Content-Type: text/markdown
34
+
35
+ # LLMJudge
36
+
37
+ > Provider-agnostic, reproducible, typed **LLM-as-a-judge** — a small primitive you can depend on.
38
+
39
+ [![CI](https://github.com/Nikolay26ru/llm-judge-kit/actions/workflows/ci.yml/badge.svg)](https://github.com/Nikolay26ru/llm-judge-kit/actions/workflows/ci.yml)
40
+ [![Python](https://img.shields.io/badge/python-3.11%2B-blue)](https://www.python.org/)
41
+ [![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
42
+ [![Typed](https://img.shields.io/badge/typed-mypy%20strict-blue)](https://mypy-lang.org/)
43
+
44
+ LLMJudge is one tiny, well-tested module for scoring model outputs with an LLM
45
+ judge — the part most projects re-implement badly. The core has **zero required
46
+ runtime dependencies**, a **stable typed API**, and runs **fully offline in
47
+ tests** via a deterministic mock.
48
+
49
+ ## Install
50
+
51
+ ```bash
52
+ pip install llm-judge-kit # core only, zero deps
53
+ pip install "llm-judge-kit[openai]" # + OpenAI-compatible provider
54
+ pip install "llm-judge-kit[anthropic]" # + Anthropic provider
55
+ pip install "llm-judge-kit[all]" # all providers
56
+ ```
57
+
58
+ ## Quickstart
59
+
60
+ This runs as-is — no API key, deterministic ([`examples/quickstart.py`](examples/quickstart.py)):
61
+
62
+ ```python
63
+ from llm_judge_kit import Judge, MockProvider
64
+
65
+ # MockProvider(fixed_score=...) keeps this example deterministic and offline.
66
+ judge = Judge(provider=MockProvider(fixed_score=0.9), rubric="factuality")
67
+
68
+ result = judge.score(
69
+ prompt="What is the capital of France?",
70
+ response="The capital of France is Paris.",
71
+ )
72
+
73
+ assert result > 0.8 # a JudgeResult compares like its float score
74
+ print(f"score={result.score} confidence={result.confidence}")
75
+ print(f"passed={result.passed()} reason={result.reason!r}")
76
+ ```
77
+
78
+ With a **real model**, swap the provider for a spec string — nothing else changes:
79
+
80
+ ```python
81
+ judge = Judge(provider="openai:gpt-5", rubric="factuality")
82
+ result = judge.score(prompt, response)
83
+ if not result.passed(0.7):
84
+ print("Failed:", result.reason, result.violations)
85
+ ```
86
+
87
+ ## Core concepts
88
+
89
+ | Piece | What it is |
90
+ | --- | --- |
91
+ | `Judge(provider, rubric)` | Pairs a model backend with a rubric; `score()` → `JudgeResult`. |
92
+ | `JudgeResult` | Frozen, typed verdict: `score`, `confidence`, `reason`, `evidence`, `violations`, `raw`, `metadata`. Compares and casts like its `score`. |
93
+ | `Provider` | A `Protocol` with one method, `complete(prompt) -> ProviderResponse`. |
94
+ | `Rubric` | Declarative description of *what* to evaluate; renders a strict-JSON judging prompt. |
95
+
96
+ ### `JudgeResult` ergonomics
97
+
98
+ ```python
99
+ r = judge.score(prompt, response)
100
+ r.score # float in [0, 1]
101
+ float(r) # same number
102
+ r > 0.8 # compares like its score
103
+ r.passed(0.7) # bool against a threshold
104
+ r.reason # short justification
105
+ r.evidence # tuple of supporting quotes/facts
106
+ r.violations # tuple of failed criteria
107
+ r.metadata # provider, model, token usage, latency, cost
108
+ ```
109
+
110
+ ### Built-in rubrics
111
+
112
+ `factuality`, `groundedness` (requires `context=`), `relevance`,
113
+ `instruction_following`, `safety`. List them with
114
+ `llm_judge_kit.available_rubrics()`.
115
+
116
+ ```python
117
+ judge = Judge(provider="openai:gpt-5", rubric="groundedness")
118
+ result = judge.score(question, answer, context=retrieved_docs) # RAG check
119
+ ```
120
+
121
+ ## Consensus (vote across models)
122
+
123
+ Run several judge models and aggregate — `confidence` reflects how much they
124
+ **agree** ([`examples/consensus.py`](examples/consensus.py)):
125
+
126
+ ```python
127
+ judge = Judge.consensus(
128
+ ["openai:gpt-5", "anthropic:claude-opus-4-8", "ollama:llama3"],
129
+ rubric="factuality",
130
+ )
131
+ result = judge.score(prompt, response)
132
+ result.score # mean (or median) of member scores
133
+ result.confidence # high when members agree, low when they diverge
134
+ result.metadata["votes"] # each member's score
135
+ ```
136
+
137
+ ## Reliability & caching
138
+
139
+ Both wrappers are providers, so they compose around any backend
140
+ ([`examples/reliability_and_cache.py`](examples/reliability_and_cache.py)):
141
+
142
+ ```python
143
+ from llm_judge_kit import Judge, OpenAIProvider, RetryProvider, CachingProvider
144
+
145
+ provider = CachingProvider( # memoize identical calls
146
+ RetryProvider( # retry w/ backoff + timeout
147
+ OpenAIProvider(model="gpt-5"), retries=3, timeout=30,
148
+ )
149
+ )
150
+ judge = Judge(provider=provider, rubric="factuality")
151
+ ```
152
+
153
+ The cache key is `version + provider + model + prompt + kwargs`, so it is stable
154
+ and invalidates correctly across library versions. Logs are emitted on the
155
+ `llm_judge_kit` logger (silent by default; call `enable_debug_logging()` to see them).
156
+
157
+ ## Integrations
158
+
159
+ ### pytest — eval as ordinary tests
160
+
161
+ Installing `llm_judge_kit` registers a pytest plugin (no conftest wiring). The
162
+ `llm_judge_kit` fixture turns an eval into a normal test; a failure reads like any
163
+ other failing assertion (score, reason, violations):
164
+
165
+ ```python
166
+ def test_answer_is_grounded(llm_judge_kit):
167
+ llm_judge_kit.assert_passes(
168
+ prompt="How tall is the Eiffel Tower?",
169
+ response=my_rag_pipeline("How tall is the Eiffel Tower?"),
170
+ rubric="groundedness",
171
+ context=retrieved_docs,
172
+ threshold=0.7,
173
+ )
174
+ ```
175
+
176
+ Pick the judge model once for the whole suite — it defaults to `mock` (offline),
177
+ so tests are green until you point them at a real model:
178
+
179
+ ```bash
180
+ pytest --llm-judge-kit-provider "openai:gpt-5" # or: export LLM_JUDGE_KIT_PROVIDER=...
181
+ ```
182
+
183
+ ### Any framework
184
+
185
+ LLMJudge judges *strings*, so it drops into any stack — LangChain, LlamaIndex,
186
+ DSPy, a raw script — with no adapter. Whatever produces the output, pass it in:
187
+
188
+ ```python
189
+ output = my_chain.invoke(question) # LangChain / LlamaIndex / your code
190
+ result = Judge(provider="openai:gpt-5", rubric="relevance").score(question, output)
191
+ ```
192
+
193
+ ## Extend without touching the core
194
+
195
+ Add a **rubric** ([`examples/custom_rubric.py`](examples/custom_rubric.py)):
196
+
197
+ ```python
198
+ from llm_judge_kit import Rubric, register_rubric
199
+
200
+ register_rubric(Rubric(
201
+ name="conciseness",
202
+ description="Whether the response is as short as possible while complete.",
203
+ criteria=("No filler or repetition.", "Every sentence earns its place."),
204
+ ))
205
+ judge = Judge(provider="openai:gpt-5", rubric="conciseness")
206
+ ```
207
+
208
+ Add a **provider** — implement one method, optionally register a scheme:
209
+
210
+ ```python
211
+ from llm_judge_kit import ProviderResponse, register_provider
212
+
213
+ class MyProvider:
214
+ name = "mine"
215
+ def complete(self, prompt: str, **kwargs: object) -> ProviderResponse:
216
+ return ProviderResponse(text=call_my_model(prompt))
217
+
218
+ register_provider("mine", lambda model: MyProvider())
219
+ judge = Judge(provider="mine:v1", rubric="relevance")
220
+ ```
221
+
222
+ ## CLI & batch evaluation
223
+
224
+ Score a whole dataset and get a report — JSON, Markdown, or HTML. A dataset is
225
+ JSON Lines (`prompt` + `response`, optional `context`/`reference`/`id`); see
226
+ [`examples/sample_dataset.jsonl`](examples/sample_dataset.jsonl).
227
+
228
+ ```bash
229
+ llm-judge-kit eval cases.jsonl --provider openai:gpt-5 --rubric factuality --format md
230
+ llm-judge-kit eval cases.jsonl --fail-under 0.9 # exit non-zero in CI if pass rate drops
231
+ llm-judge-kit compare cases.jsonl --provider openai:gpt-5 --provider anthropic:claude-opus-4-8
232
+ llm-judge-kit report report.json --format html -o report.html
233
+ ```
234
+
235
+ Same thing in code ([`examples/benchmark_report.py`](examples/benchmark_report.py)):
236
+
237
+ ```python
238
+ from llm_judge_kit import Judge, load_dataset, run_benchmark, render_markdown
239
+
240
+ cases = load_dataset("cases.jsonl")
241
+ judge = Judge(provider="openai:gpt-5", rubric="factuality")
242
+ report = run_benchmark(judge, cases, provider="openai:gpt-5", rubric="factuality")
243
+ print(report.pass_rate, report.mean_score)
244
+ print(render_markdown(report))
245
+ ```
246
+
247
+ ## Why depend on this
248
+
249
+ - **Easy to depend on** — zero transitive deps in the core; provider SDKs are opt-in extras.
250
+ - **Reproducible** — deterministic offline `MockProvider`; all unit tests run without network.
251
+ - **Typed** — `mypy --strict` clean; ships `py.typed`.
252
+ - **Robust parsing** — recovers JSON from markdown fences, prose, and trailing commas.
253
+ - **Extensible** — new provider / rubric / judge without core changes.
254
+
255
+ ## Development
256
+
257
+ ```bash
258
+ uv sync --all-extras
259
+ uv run ruff check . && uv run ruff format --check . && uv run mypy src && uv run pytest --cov=llm_judge_kit --cov-report=term-missing
260
+ ```
261
+
262
+ See [CONTRIBUTING.md](CONTRIBUTING.md). The plan of record is in [ROADMAP.md](ROADMAP.md).
263
+
264
+ ## License
265
+
266
+ MIT — see [LICENSE](LICENSE).
@@ -0,0 +1,232 @@
1
+ # LLMJudge
2
+
3
+ > Provider-agnostic, reproducible, typed **LLM-as-a-judge** — a small primitive you can depend on.
4
+
5
+ [![CI](https://github.com/Nikolay26ru/llm-judge-kit/actions/workflows/ci.yml/badge.svg)](https://github.com/Nikolay26ru/llm-judge-kit/actions/workflows/ci.yml)
6
+ [![Python](https://img.shields.io/badge/python-3.11%2B-blue)](https://www.python.org/)
7
+ [![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
8
+ [![Typed](https://img.shields.io/badge/typed-mypy%20strict-blue)](https://mypy-lang.org/)
9
+
10
+ LLMJudge is one tiny, well-tested module for scoring model outputs with an LLM
11
+ judge — the part most projects re-implement badly. The core has **zero required
12
+ runtime dependencies**, a **stable typed API**, and runs **fully offline in
13
+ tests** via a deterministic mock.
14
+
15
+ ## Install
16
+
17
+ ```bash
18
+ pip install llm-judge-kit # core only, zero deps
19
+ pip install "llm-judge-kit[openai]" # + OpenAI-compatible provider
20
+ pip install "llm-judge-kit[anthropic]" # + Anthropic provider
21
+ pip install "llm-judge-kit[all]" # all providers
22
+ ```
23
+
24
+ ## Quickstart
25
+
26
+ This runs as-is — no API key, deterministic ([`examples/quickstart.py`](examples/quickstart.py)):
27
+
28
+ ```python
29
+ from llm_judge_kit import Judge, MockProvider
30
+
31
+ # MockProvider(fixed_score=...) keeps this example deterministic and offline.
32
+ judge = Judge(provider=MockProvider(fixed_score=0.9), rubric="factuality")
33
+
34
+ result = judge.score(
35
+ prompt="What is the capital of France?",
36
+ response="The capital of France is Paris.",
37
+ )
38
+
39
+ assert result > 0.8 # a JudgeResult compares like its float score
40
+ print(f"score={result.score} confidence={result.confidence}")
41
+ print(f"passed={result.passed()} reason={result.reason!r}")
42
+ ```
43
+
44
+ With a **real model**, swap the provider for a spec string — nothing else changes:
45
+
46
+ ```python
47
+ judge = Judge(provider="openai:gpt-5", rubric="factuality")
48
+ result = judge.score(prompt, response)
49
+ if not result.passed(0.7):
50
+ print("Failed:", result.reason, result.violations)
51
+ ```
52
+
53
+ ## Core concepts
54
+
55
+ | Piece | What it is |
56
+ | --- | --- |
57
+ | `Judge(provider, rubric)` | Pairs a model backend with a rubric; `score()` → `JudgeResult`. |
58
+ | `JudgeResult` | Frozen, typed verdict: `score`, `confidence`, `reason`, `evidence`, `violations`, `raw`, `metadata`. Compares and casts like its `score`. |
59
+ | `Provider` | A `Protocol` with one method, `complete(prompt) -> ProviderResponse`. |
60
+ | `Rubric` | Declarative description of *what* to evaluate; renders a strict-JSON judging prompt. |
61
+
62
+ ### `JudgeResult` ergonomics
63
+
64
+ ```python
65
+ r = judge.score(prompt, response)
66
+ r.score # float in [0, 1]
67
+ float(r) # same number
68
+ r > 0.8 # compares like its score
69
+ r.passed(0.7) # bool against a threshold
70
+ r.reason # short justification
71
+ r.evidence # tuple of supporting quotes/facts
72
+ r.violations # tuple of failed criteria
73
+ r.metadata # provider, model, token usage, latency, cost
74
+ ```
75
+
76
+ ### Built-in rubrics
77
+
78
+ `factuality`, `groundedness` (requires `context=`), `relevance`,
79
+ `instruction_following`, `safety`. List them with
80
+ `llm_judge_kit.available_rubrics()`.
81
+
82
+ ```python
83
+ judge = Judge(provider="openai:gpt-5", rubric="groundedness")
84
+ result = judge.score(question, answer, context=retrieved_docs) # RAG check
85
+ ```
86
+
87
+ ## Consensus (vote across models)
88
+
89
+ Run several judge models and aggregate — `confidence` reflects how much they
90
+ **agree** ([`examples/consensus.py`](examples/consensus.py)):
91
+
92
+ ```python
93
+ judge = Judge.consensus(
94
+ ["openai:gpt-5", "anthropic:claude-opus-4-8", "ollama:llama3"],
95
+ rubric="factuality",
96
+ )
97
+ result = judge.score(prompt, response)
98
+ result.score # mean (or median) of member scores
99
+ result.confidence # high when members agree, low when they diverge
100
+ result.metadata["votes"] # each member's score
101
+ ```
102
+
103
+ ## Reliability & caching
104
+
105
+ Both wrappers are providers, so they compose around any backend
106
+ ([`examples/reliability_and_cache.py`](examples/reliability_and_cache.py)):
107
+
108
+ ```python
109
+ from llm_judge_kit import Judge, OpenAIProvider, RetryProvider, CachingProvider
110
+
111
+ provider = CachingProvider( # memoize identical calls
112
+ RetryProvider( # retry w/ backoff + timeout
113
+ OpenAIProvider(model="gpt-5"), retries=3, timeout=30,
114
+ )
115
+ )
116
+ judge = Judge(provider=provider, rubric="factuality")
117
+ ```
118
+
119
+ The cache key is `version + provider + model + prompt + kwargs`, so it is stable
120
+ and invalidates correctly across library versions. Logs are emitted on the
121
+ `llm_judge_kit` logger (silent by default; call `enable_debug_logging()` to see them).
122
+
123
+ ## Integrations
124
+
125
+ ### pytest — eval as ordinary tests
126
+
127
+ Installing `llm_judge_kit` registers a pytest plugin (no conftest wiring). The
128
+ `llm_judge_kit` fixture turns an eval into a normal test; a failure reads like any
129
+ other failing assertion (score, reason, violations):
130
+
131
+ ```python
132
+ def test_answer_is_grounded(llm_judge_kit):
133
+ llm_judge_kit.assert_passes(
134
+ prompt="How tall is the Eiffel Tower?",
135
+ response=my_rag_pipeline("How tall is the Eiffel Tower?"),
136
+ rubric="groundedness",
137
+ context=retrieved_docs,
138
+ threshold=0.7,
139
+ )
140
+ ```
141
+
142
+ Pick the judge model once for the whole suite — it defaults to `mock` (offline),
143
+ so tests are green until you point them at a real model:
144
+
145
+ ```bash
146
+ pytest --llm-judge-kit-provider "openai:gpt-5" # or: export LLM_JUDGE_KIT_PROVIDER=...
147
+ ```
148
+
149
+ ### Any framework
150
+
151
+ LLMJudge judges *strings*, so it drops into any stack — LangChain, LlamaIndex,
152
+ DSPy, a raw script — with no adapter. Whatever produces the output, pass it in:
153
+
154
+ ```python
155
+ output = my_chain.invoke(question) # LangChain / LlamaIndex / your code
156
+ result = Judge(provider="openai:gpt-5", rubric="relevance").score(question, output)
157
+ ```
158
+
159
+ ## Extend without touching the core
160
+
161
+ Add a **rubric** ([`examples/custom_rubric.py`](examples/custom_rubric.py)):
162
+
163
+ ```python
164
+ from llm_judge_kit import Rubric, register_rubric
165
+
166
+ register_rubric(Rubric(
167
+ name="conciseness",
168
+ description="Whether the response is as short as possible while complete.",
169
+ criteria=("No filler or repetition.", "Every sentence earns its place."),
170
+ ))
171
+ judge = Judge(provider="openai:gpt-5", rubric="conciseness")
172
+ ```
173
+
174
+ Add a **provider** — implement one method, optionally register a scheme:
175
+
176
+ ```python
177
+ from llm_judge_kit import ProviderResponse, register_provider
178
+
179
+ class MyProvider:
180
+ name = "mine"
181
+ def complete(self, prompt: str, **kwargs: object) -> ProviderResponse:
182
+ return ProviderResponse(text=call_my_model(prompt))
183
+
184
+ register_provider("mine", lambda model: MyProvider())
185
+ judge = Judge(provider="mine:v1", rubric="relevance")
186
+ ```
187
+
188
+ ## CLI & batch evaluation
189
+
190
+ Score a whole dataset and get a report — JSON, Markdown, or HTML. A dataset is
191
+ JSON Lines (`prompt` + `response`, optional `context`/`reference`/`id`); see
192
+ [`examples/sample_dataset.jsonl`](examples/sample_dataset.jsonl).
193
+
194
+ ```bash
195
+ llm-judge-kit eval cases.jsonl --provider openai:gpt-5 --rubric factuality --format md
196
+ llm-judge-kit eval cases.jsonl --fail-under 0.9 # exit non-zero in CI if pass rate drops
197
+ llm-judge-kit compare cases.jsonl --provider openai:gpt-5 --provider anthropic:claude-opus-4-8
198
+ llm-judge-kit report report.json --format html -o report.html
199
+ ```
200
+
201
+ Same thing in code ([`examples/benchmark_report.py`](examples/benchmark_report.py)):
202
+
203
+ ```python
204
+ from llm_judge_kit import Judge, load_dataset, run_benchmark, render_markdown
205
+
206
+ cases = load_dataset("cases.jsonl")
207
+ judge = Judge(provider="openai:gpt-5", rubric="factuality")
208
+ report = run_benchmark(judge, cases, provider="openai:gpt-5", rubric="factuality")
209
+ print(report.pass_rate, report.mean_score)
210
+ print(render_markdown(report))
211
+ ```
212
+
213
+ ## Why depend on this
214
+
215
+ - **Easy to depend on** — zero transitive deps in the core; provider SDKs are opt-in extras.
216
+ - **Reproducible** — deterministic offline `MockProvider`; all unit tests run without network.
217
+ - **Typed** — `mypy --strict` clean; ships `py.typed`.
218
+ - **Robust parsing** — recovers JSON from markdown fences, prose, and trailing commas.
219
+ - **Extensible** — new provider / rubric / judge without core changes.
220
+
221
+ ## Development
222
+
223
+ ```bash
224
+ uv sync --all-extras
225
+ uv run ruff check . && uv run ruff format --check . && uv run mypy src && uv run pytest --cov=llm_judge_kit --cov-report=term-missing
226
+ ```
227
+
228
+ See [CONTRIBUTING.md](CONTRIBUTING.md). The plan of record is in [ROADMAP.md](ROADMAP.md).
229
+
230
+ ## License
231
+
232
+ MIT — see [LICENSE](LICENSE).