llm-judge-kit 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- llm_judge_kit-0.1.0/.gitignore +34 -0
- llm_judge_kit-0.1.0/CHANGELOG.md +75 -0
- llm_judge_kit-0.1.0/LICENSE +21 -0
- llm_judge_kit-0.1.0/PKG-INFO +266 -0
- llm_judge_kit-0.1.0/README.md +232 -0
- llm_judge_kit-0.1.0/pyproject.toml +141 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/__init__.py +98 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/_logging.py +51 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/_version.py +9 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/benchmark.py +91 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/caching.py +63 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/cli.py +153 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/consensus.py +123 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/dataset.py +110 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/errors.py +43 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/judge.py +192 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/parsing.py +127 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/providers/__init__.py +40 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/providers/anthropic.py +84 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/providers/base.py +33 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/providers/mock.py +81 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/providers/ollama.py +81 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/providers/openai.py +81 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/providers/registry.py +67 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/py.typed +0 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/reliability.py +100 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/reporting.py +197 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/rubrics/__init__.py +33 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/rubrics/base.py +118 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/rubrics/builtin.py +77 -0
- llm_judge_kit-0.1.0/src/llm_judge_kit/types.py +91 -0
- llm_judge_kit-0.1.0/src/pytest_llm_judge_kit.py +118 -0
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
# Python
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[cod]
|
|
4
|
+
*.egg-info/
|
|
5
|
+
.eggs/
|
|
6
|
+
build/
|
|
7
|
+
dist/
|
|
8
|
+
*.whl
|
|
9
|
+
|
|
10
|
+
# Environments / secrets
|
|
11
|
+
.venv/
|
|
12
|
+
venv/
|
|
13
|
+
.env
|
|
14
|
+
.env.*
|
|
15
|
+
!.env.example
|
|
16
|
+
|
|
17
|
+
# Tooling caches
|
|
18
|
+
.pytest_cache/
|
|
19
|
+
.mypy_cache/
|
|
20
|
+
.ruff_cache/
|
|
21
|
+
.coverage
|
|
22
|
+
.coverage.*
|
|
23
|
+
htmlcov/
|
|
24
|
+
coverage.xml
|
|
25
|
+
|
|
26
|
+
# OS / editors
|
|
27
|
+
.DS_Store
|
|
28
|
+
.idea/
|
|
29
|
+
.vscode/
|
|
30
|
+
|
|
31
|
+
# Local-only working notes (not part of the public project)
|
|
32
|
+
CLAUDE.md
|
|
33
|
+
PROMPT.md
|
|
34
|
+
docs/strategy.md
|
|
@@ -0,0 +1,75 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project are documented here.
|
|
4
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
|
5
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
6
|
+
|
|
7
|
+
## [Unreleased]
|
|
8
|
+
|
|
9
|
+
### Added
|
|
10
|
+
- Project scaffold: `pyproject.toml`, tooling gate (ruff + mypy strict + pytest
|
|
11
|
+
with ≥95% coverage), CI workflow, MIT license, contributor docs.
|
|
12
|
+
- **Wedge core (M1):**
|
|
13
|
+
- `Judge` — pairs a provider with a rubric; `score()` returns a typed
|
|
14
|
+
`JudgeResult`; `__call__` alias and `passed()` convenience.
|
|
15
|
+
- `JudgeResult` / `ProviderResponse` — frozen, typed; `JudgeResult` casts
|
|
16
|
+
(`__float__`) and compares (`<`, `<=`, `>`, `>=`) like its score.
|
|
17
|
+
- `Provider` protocol + `BaseProvider`; deterministic offline `MockProvider`.
|
|
18
|
+
- Provider registry with `"scheme:model"` spec parsing (`register_provider`,
|
|
19
|
+
`make_provider`, `available_providers`).
|
|
20
|
+
- `Rubric` + registry and five built-ins: factuality, groundedness,
|
|
21
|
+
relevance, instruction_following, safety (`register_rubric`, `get_rubric`,
|
|
22
|
+
`available_rubrics`).
|
|
23
|
+
- Robust `extract_json` parser (markdown fences, surrounding prose, trailing
|
|
24
|
+
commas, nested braces) with a clear `ParseError`.
|
|
25
|
+
- Typed exception hierarchy (`LLMJudgeError` and subclasses).
|
|
26
|
+
- Runnable examples and a README quickstart that executes offline.
|
|
27
|
+
|
|
28
|
+
- **Real providers (M2):**
|
|
29
|
+
- `OpenAIProvider` (Chat Completions; works with any OpenAI-compatible
|
|
30
|
+
`base_url`), `AnthropicProvider` (Messages API), `OllamaProvider` (local,
|
|
31
|
+
via httpx). Registered under the `openai:`/`anthropic:`/`ollama:` schemes.
|
|
32
|
+
- SDKs are imported lazily — the core stays dependency-free; the import only
|
|
33
|
+
happens when a real client is built, with a clear error pointing at the
|
|
34
|
+
extra to install.
|
|
35
|
+
- Offline unit tests drive every provider through injected fake clients;
|
|
36
|
+
live API tests are gated behind `LLM_JUDGE_KIT_LIVE_TESTS=1` (skipped by default).
|
|
37
|
+
|
|
38
|
+
- **Consensus + reliability (M3):**
|
|
39
|
+
- `ConsensusJudge` and `Judge.consensus([...], rubric=...)` — aggregate
|
|
40
|
+
several judges (mean/median); `confidence` is derived from agreement
|
|
41
|
+
(tight spread → high confidence) and member votes land in `metadata`.
|
|
42
|
+
- `RetryProvider` — composable retry-with-backoff and optional per-call
|
|
43
|
+
timeout wrapper (injectable sleep for deterministic tests).
|
|
44
|
+
- `CachingProvider` — memoizes completions; key = hash of
|
|
45
|
+
`version + provider + model + prompt + kwargs` (pluggable store).
|
|
46
|
+
- Structured logging on the `llm_judge_kit` logger (ships a `NullHandler`;
|
|
47
|
+
`enable_debug_logging()` for local debugging).
|
|
48
|
+
- Version moved to `_version.py` (single source; hatchling dynamic version).
|
|
49
|
+
|
|
50
|
+
- **Adoption surface (M4):**
|
|
51
|
+
- Bundled pytest plugin (`pytest11` entry point → top-level `pytest_llm_judge_kit`
|
|
52
|
+
module): the `llm_judge_kit` fixture and `JudgeHelper.assert_passes` let you
|
|
53
|
+
write evals as ordinary pytest tests, with score/reason/violations in the
|
|
54
|
+
failure message. Choose the model with `--llm-judge-kit-provider` or
|
|
55
|
+
`$LLM_JUDGE_KIT_PROVIDER` (defaults to `mock`, so suites run offline).
|
|
56
|
+
- README "Integrations" section, including a framework-agnostic note (works
|
|
57
|
+
with LangChain / LlamaIndex / any pipeline — it judges strings).
|
|
58
|
+
|
|
59
|
+
- **Platform layer (M5):**
|
|
60
|
+
- `llm_judge_kit` CLI (argparse, no new deps): `eval` (run a dataset → report,
|
|
61
|
+
with `--fail-under` for CI gating), `compare` (several providers side by
|
|
62
|
+
side), `report` (re-render a saved JSON report).
|
|
63
|
+
- Dataset loader (`load_dataset`, `Case`) for `.jsonl`/`.ndjson`/`.json`.
|
|
64
|
+
- Benchmark engine (`run_benchmark`, `Report` with count/passed/pass_rate/
|
|
65
|
+
mean_score), kept separate from reporting.
|
|
66
|
+
- Reporting (`render_json`/`render_markdown`/`render_html`, `load_report`) —
|
|
67
|
+
JSON round-trips back into a `Report`.
|
|
68
|
+
|
|
69
|
+
### Hardened (M1 adversarial review)
|
|
70
|
+
- Trailing-comma JSON repair is now string-aware — it no longer corrupts string
|
|
71
|
+
values that contain `,}` or `,]`.
|
|
72
|
+
- Non-finite scores/confidence (`NaN`, `Infinity`) are rejected/defaulted
|
|
73
|
+
instead of silently clamping to a perfect `1.0`.
|
|
74
|
+
- `JudgeResult` and `ProviderResponse` are now genuinely hashable (the `dict`
|
|
75
|
+
`metadata` field is excluded from the hash).
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 LLMJudge contributors
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,266 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: llm-judge-kit
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Provider-agnostic, reproducible, typed LLM-as-a-judge — a small primitive you can depend on.
|
|
5
|
+
Project-URL: Homepage, https://github.com/Nikolay26ru/llm-judge-kit
|
|
6
|
+
Project-URL: Repository, https://github.com/Nikolay26ru/llm-judge-kit
|
|
7
|
+
Project-URL: Changelog, https://github.com/Nikolay26ru/llm-judge-kit/blob/main/CHANGELOG.md
|
|
8
|
+
Author: LLMJudge contributors
|
|
9
|
+
License-Expression: MIT
|
|
10
|
+
License-File: LICENSE
|
|
11
|
+
Keywords: eval,evaluation,llm,llm-as-a-judge,rubric,testing
|
|
12
|
+
Classifier: Development Status :: 4 - Beta
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
15
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
19
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
20
|
+
Classifier: Topic :: Software Development :: Testing
|
|
21
|
+
Classifier: Typing :: Typed
|
|
22
|
+
Requires-Python: >=3.11
|
|
23
|
+
Provides-Extra: all
|
|
24
|
+
Requires-Dist: anthropic>=0.40; extra == 'all'
|
|
25
|
+
Requires-Dist: httpx>=0.27; extra == 'all'
|
|
26
|
+
Requires-Dist: openai>=1.0; extra == 'all'
|
|
27
|
+
Provides-Extra: anthropic
|
|
28
|
+
Requires-Dist: anthropic>=0.40; extra == 'anthropic'
|
|
29
|
+
Provides-Extra: ollama
|
|
30
|
+
Requires-Dist: httpx>=0.27; extra == 'ollama'
|
|
31
|
+
Provides-Extra: openai
|
|
32
|
+
Requires-Dist: openai>=1.0; extra == 'openai'
|
|
33
|
+
Description-Content-Type: text/markdown
|
|
34
|
+
|
|
35
|
+
# LLMJudge
|
|
36
|
+
|
|
37
|
+
> Provider-agnostic, reproducible, typed **LLM-as-a-judge** — a small primitive you can depend on.
|
|
38
|
+
|
|
39
|
+
[](https://github.com/Nikolay26ru/llm-judge-kit/actions/workflows/ci.yml)
|
|
40
|
+
[](https://www.python.org/)
|
|
41
|
+
[](LICENSE)
|
|
42
|
+
[](https://mypy-lang.org/)
|
|
43
|
+
|
|
44
|
+
LLMJudge is one tiny, well-tested module for scoring model outputs with an LLM
|
|
45
|
+
judge — the part most projects re-implement badly. The core has **zero required
|
|
46
|
+
runtime dependencies**, a **stable typed API**, and runs **fully offline in
|
|
47
|
+
tests** via a deterministic mock.
|
|
48
|
+
|
|
49
|
+
## Install
|
|
50
|
+
|
|
51
|
+
```bash
|
|
52
|
+
pip install llm-judge-kit # core only, zero deps
|
|
53
|
+
pip install "llm-judge-kit[openai]" # + OpenAI-compatible provider
|
|
54
|
+
pip install "llm-judge-kit[anthropic]" # + Anthropic provider
|
|
55
|
+
pip install "llm-judge-kit[all]" # all providers
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
## Quickstart
|
|
59
|
+
|
|
60
|
+
This runs as-is — no API key, deterministic ([`examples/quickstart.py`](examples/quickstart.py)):
|
|
61
|
+
|
|
62
|
+
```python
|
|
63
|
+
from llm_judge_kit import Judge, MockProvider
|
|
64
|
+
|
|
65
|
+
# MockProvider(fixed_score=...) keeps this example deterministic and offline.
|
|
66
|
+
judge = Judge(provider=MockProvider(fixed_score=0.9), rubric="factuality")
|
|
67
|
+
|
|
68
|
+
result = judge.score(
|
|
69
|
+
prompt="What is the capital of France?",
|
|
70
|
+
response="The capital of France is Paris.",
|
|
71
|
+
)
|
|
72
|
+
|
|
73
|
+
assert result > 0.8 # a JudgeResult compares like its float score
|
|
74
|
+
print(f"score={result.score} confidence={result.confidence}")
|
|
75
|
+
print(f"passed={result.passed()} reason={result.reason!r}")
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
With a **real model**, swap the provider for a spec string — nothing else changes:
|
|
79
|
+
|
|
80
|
+
```python
|
|
81
|
+
judge = Judge(provider="openai:gpt-5", rubric="factuality")
|
|
82
|
+
result = judge.score(prompt, response)
|
|
83
|
+
if not result.passed(0.7):
|
|
84
|
+
print("Failed:", result.reason, result.violations)
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
## Core concepts
|
|
88
|
+
|
|
89
|
+
| Piece | What it is |
|
|
90
|
+
| --- | --- |
|
|
91
|
+
| `Judge(provider, rubric)` | Pairs a model backend with a rubric; `score()` → `JudgeResult`. |
|
|
92
|
+
| `JudgeResult` | Frozen, typed verdict: `score`, `confidence`, `reason`, `evidence`, `violations`, `raw`, `metadata`. Compares and casts like its `score`. |
|
|
93
|
+
| `Provider` | A `Protocol` with one method, `complete(prompt) -> ProviderResponse`. |
|
|
94
|
+
| `Rubric` | Declarative description of *what* to evaluate; renders a strict-JSON judging prompt. |
|
|
95
|
+
|
|
96
|
+
### `JudgeResult` ergonomics
|
|
97
|
+
|
|
98
|
+
```python
|
|
99
|
+
r = judge.score(prompt, response)
|
|
100
|
+
r.score # float in [0, 1]
|
|
101
|
+
float(r) # same number
|
|
102
|
+
r > 0.8 # compares like its score
|
|
103
|
+
r.passed(0.7) # bool against a threshold
|
|
104
|
+
r.reason # short justification
|
|
105
|
+
r.evidence # tuple of supporting quotes/facts
|
|
106
|
+
r.violations # tuple of failed criteria
|
|
107
|
+
r.metadata # provider, model, token usage, latency, cost
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
### Built-in rubrics
|
|
111
|
+
|
|
112
|
+
`factuality`, `groundedness` (requires `context=`), `relevance`,
|
|
113
|
+
`instruction_following`, `safety`. List them with
|
|
114
|
+
`llm_judge_kit.available_rubrics()`.
|
|
115
|
+
|
|
116
|
+
```python
|
|
117
|
+
judge = Judge(provider="openai:gpt-5", rubric="groundedness")
|
|
118
|
+
result = judge.score(question, answer, context=retrieved_docs) # RAG check
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
## Consensus (vote across models)
|
|
122
|
+
|
|
123
|
+
Run several judge models and aggregate — `confidence` reflects how much they
|
|
124
|
+
**agree** ([`examples/consensus.py`](examples/consensus.py)):
|
|
125
|
+
|
|
126
|
+
```python
|
|
127
|
+
judge = Judge.consensus(
|
|
128
|
+
["openai:gpt-5", "anthropic:claude-opus-4-8", "ollama:llama3"],
|
|
129
|
+
rubric="factuality",
|
|
130
|
+
)
|
|
131
|
+
result = judge.score(prompt, response)
|
|
132
|
+
result.score # mean (or median) of member scores
|
|
133
|
+
result.confidence # high when members agree, low when they diverge
|
|
134
|
+
result.metadata["votes"] # each member's score
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
## Reliability & caching
|
|
138
|
+
|
|
139
|
+
Both wrappers are providers, so they compose around any backend
|
|
140
|
+
([`examples/reliability_and_cache.py`](examples/reliability_and_cache.py)):
|
|
141
|
+
|
|
142
|
+
```python
|
|
143
|
+
from llm_judge_kit import Judge, OpenAIProvider, RetryProvider, CachingProvider
|
|
144
|
+
|
|
145
|
+
provider = CachingProvider( # memoize identical calls
|
|
146
|
+
RetryProvider( # retry w/ backoff + timeout
|
|
147
|
+
OpenAIProvider(model="gpt-5"), retries=3, timeout=30,
|
|
148
|
+
)
|
|
149
|
+
)
|
|
150
|
+
judge = Judge(provider=provider, rubric="factuality")
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
The cache key is `version + provider + model + prompt + kwargs`, so it is stable
|
|
154
|
+
and invalidates correctly across library versions. Logs are emitted on the
|
|
155
|
+
`llm_judge_kit` logger (silent by default; call `enable_debug_logging()` to see them).
|
|
156
|
+
|
|
157
|
+
## Integrations
|
|
158
|
+
|
|
159
|
+
### pytest — eval as ordinary tests
|
|
160
|
+
|
|
161
|
+
Installing `llm_judge_kit` registers a pytest plugin (no conftest wiring). The
|
|
162
|
+
`llm_judge_kit` fixture turns an eval into a normal test; a failure reads like any
|
|
163
|
+
other failing assertion (score, reason, violations):
|
|
164
|
+
|
|
165
|
+
```python
|
|
166
|
+
def test_answer_is_grounded(llm_judge_kit):
|
|
167
|
+
llm_judge_kit.assert_passes(
|
|
168
|
+
prompt="How tall is the Eiffel Tower?",
|
|
169
|
+
response=my_rag_pipeline("How tall is the Eiffel Tower?"),
|
|
170
|
+
rubric="groundedness",
|
|
171
|
+
context=retrieved_docs,
|
|
172
|
+
threshold=0.7,
|
|
173
|
+
)
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
Pick the judge model once for the whole suite — it defaults to `mock` (offline),
|
|
177
|
+
so tests are green until you point them at a real model:
|
|
178
|
+
|
|
179
|
+
```bash
|
|
180
|
+
pytest --llm-judge-kit-provider "openai:gpt-5" # or: export LLM_JUDGE_KIT_PROVIDER=...
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
### Any framework
|
|
184
|
+
|
|
185
|
+
LLMJudge judges *strings*, so it drops into any stack — LangChain, LlamaIndex,
|
|
186
|
+
DSPy, a raw script — with no adapter. Whatever produces the output, pass it in:
|
|
187
|
+
|
|
188
|
+
```python
|
|
189
|
+
output = my_chain.invoke(question) # LangChain / LlamaIndex / your code
|
|
190
|
+
result = Judge(provider="openai:gpt-5", rubric="relevance").score(question, output)
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
## Extend without touching the core
|
|
194
|
+
|
|
195
|
+
Add a **rubric** ([`examples/custom_rubric.py`](examples/custom_rubric.py)):
|
|
196
|
+
|
|
197
|
+
```python
|
|
198
|
+
from llm_judge_kit import Rubric, register_rubric
|
|
199
|
+
|
|
200
|
+
register_rubric(Rubric(
|
|
201
|
+
name="conciseness",
|
|
202
|
+
description="Whether the response is as short as possible while complete.",
|
|
203
|
+
criteria=("No filler or repetition.", "Every sentence earns its place."),
|
|
204
|
+
))
|
|
205
|
+
judge = Judge(provider="openai:gpt-5", rubric="conciseness")
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
Add a **provider** — implement one method, optionally register a scheme:
|
|
209
|
+
|
|
210
|
+
```python
|
|
211
|
+
from llm_judge_kit import ProviderResponse, register_provider
|
|
212
|
+
|
|
213
|
+
class MyProvider:
|
|
214
|
+
name = "mine"
|
|
215
|
+
def complete(self, prompt: str, **kwargs: object) -> ProviderResponse:
|
|
216
|
+
return ProviderResponse(text=call_my_model(prompt))
|
|
217
|
+
|
|
218
|
+
register_provider("mine", lambda model: MyProvider())
|
|
219
|
+
judge = Judge(provider="mine:v1", rubric="relevance")
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
## CLI & batch evaluation
|
|
223
|
+
|
|
224
|
+
Score a whole dataset and get a report — JSON, Markdown, or HTML. A dataset is
|
|
225
|
+
JSON Lines (`prompt` + `response`, optional `context`/`reference`/`id`); see
|
|
226
|
+
[`examples/sample_dataset.jsonl`](examples/sample_dataset.jsonl).
|
|
227
|
+
|
|
228
|
+
```bash
|
|
229
|
+
llm-judge-kit eval cases.jsonl --provider openai:gpt-5 --rubric factuality --format md
|
|
230
|
+
llm-judge-kit eval cases.jsonl --fail-under 0.9 # exit non-zero in CI if pass rate drops
|
|
231
|
+
llm-judge-kit compare cases.jsonl --provider openai:gpt-5 --provider anthropic:claude-opus-4-8
|
|
232
|
+
llm-judge-kit report report.json --format html -o report.html
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
Same thing in code ([`examples/benchmark_report.py`](examples/benchmark_report.py)):
|
|
236
|
+
|
|
237
|
+
```python
|
|
238
|
+
from llm_judge_kit import Judge, load_dataset, run_benchmark, render_markdown
|
|
239
|
+
|
|
240
|
+
cases = load_dataset("cases.jsonl")
|
|
241
|
+
judge = Judge(provider="openai:gpt-5", rubric="factuality")
|
|
242
|
+
report = run_benchmark(judge, cases, provider="openai:gpt-5", rubric="factuality")
|
|
243
|
+
print(report.pass_rate, report.mean_score)
|
|
244
|
+
print(render_markdown(report))
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
## Why depend on this
|
|
248
|
+
|
|
249
|
+
- **Easy to depend on** — zero transitive deps in the core; provider SDKs are opt-in extras.
|
|
250
|
+
- **Reproducible** — deterministic offline `MockProvider`; all unit tests run without network.
|
|
251
|
+
- **Typed** — `mypy --strict` clean; ships `py.typed`.
|
|
252
|
+
- **Robust parsing** — recovers JSON from markdown fences, prose, and trailing commas.
|
|
253
|
+
- **Extensible** — new provider / rubric / judge without core changes.
|
|
254
|
+
|
|
255
|
+
## Development
|
|
256
|
+
|
|
257
|
+
```bash
|
|
258
|
+
uv sync --all-extras
|
|
259
|
+
uv run ruff check . && uv run ruff format --check . && uv run mypy src && uv run pytest --cov=llm_judge_kit --cov-report=term-missing
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
See [CONTRIBUTING.md](CONTRIBUTING.md). The plan of record is in [ROADMAP.md](ROADMAP.md).
|
|
263
|
+
|
|
264
|
+
## License
|
|
265
|
+
|
|
266
|
+
MIT — see [LICENSE](LICENSE).
|
|
@@ -0,0 +1,232 @@
|
|
|
1
|
+
# LLMJudge
|
|
2
|
+
|
|
3
|
+
> Provider-agnostic, reproducible, typed **LLM-as-a-judge** — a small primitive you can depend on.
|
|
4
|
+
|
|
5
|
+
[](https://github.com/Nikolay26ru/llm-judge-kit/actions/workflows/ci.yml)
|
|
6
|
+
[](https://www.python.org/)
|
|
7
|
+
[](LICENSE)
|
|
8
|
+
[](https://mypy-lang.org/)
|
|
9
|
+
|
|
10
|
+
LLMJudge is one tiny, well-tested module for scoring model outputs with an LLM
|
|
11
|
+
judge — the part most projects re-implement badly. The core has **zero required
|
|
12
|
+
runtime dependencies**, a **stable typed API**, and runs **fully offline in
|
|
13
|
+
tests** via a deterministic mock.
|
|
14
|
+
|
|
15
|
+
## Install
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
pip install llm-judge-kit # core only, zero deps
|
|
19
|
+
pip install "llm-judge-kit[openai]" # + OpenAI-compatible provider
|
|
20
|
+
pip install "llm-judge-kit[anthropic]" # + Anthropic provider
|
|
21
|
+
pip install "llm-judge-kit[all]" # all providers
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
## Quickstart
|
|
25
|
+
|
|
26
|
+
This runs as-is — no API key, deterministic ([`examples/quickstart.py`](examples/quickstart.py)):
|
|
27
|
+
|
|
28
|
+
```python
|
|
29
|
+
from llm_judge_kit import Judge, MockProvider
|
|
30
|
+
|
|
31
|
+
# MockProvider(fixed_score=...) keeps this example deterministic and offline.
|
|
32
|
+
judge = Judge(provider=MockProvider(fixed_score=0.9), rubric="factuality")
|
|
33
|
+
|
|
34
|
+
result = judge.score(
|
|
35
|
+
prompt="What is the capital of France?",
|
|
36
|
+
response="The capital of France is Paris.",
|
|
37
|
+
)
|
|
38
|
+
|
|
39
|
+
assert result > 0.8 # a JudgeResult compares like its float score
|
|
40
|
+
print(f"score={result.score} confidence={result.confidence}")
|
|
41
|
+
print(f"passed={result.passed()} reason={result.reason!r}")
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
With a **real model**, swap the provider for a spec string — nothing else changes:
|
|
45
|
+
|
|
46
|
+
```python
|
|
47
|
+
judge = Judge(provider="openai:gpt-5", rubric="factuality")
|
|
48
|
+
result = judge.score(prompt, response)
|
|
49
|
+
if not result.passed(0.7):
|
|
50
|
+
print("Failed:", result.reason, result.violations)
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
## Core concepts
|
|
54
|
+
|
|
55
|
+
| Piece | What it is |
|
|
56
|
+
| --- | --- |
|
|
57
|
+
| `Judge(provider, rubric)` | Pairs a model backend with a rubric; `score()` → `JudgeResult`. |
|
|
58
|
+
| `JudgeResult` | Frozen, typed verdict: `score`, `confidence`, `reason`, `evidence`, `violations`, `raw`, `metadata`. Compares and casts like its `score`. |
|
|
59
|
+
| `Provider` | A `Protocol` with one method, `complete(prompt) -> ProviderResponse`. |
|
|
60
|
+
| `Rubric` | Declarative description of *what* to evaluate; renders a strict-JSON judging prompt. |
|
|
61
|
+
|
|
62
|
+
### `JudgeResult` ergonomics
|
|
63
|
+
|
|
64
|
+
```python
|
|
65
|
+
r = judge.score(prompt, response)
|
|
66
|
+
r.score # float in [0, 1]
|
|
67
|
+
float(r) # same number
|
|
68
|
+
r > 0.8 # compares like its score
|
|
69
|
+
r.passed(0.7) # bool against a threshold
|
|
70
|
+
r.reason # short justification
|
|
71
|
+
r.evidence # tuple of supporting quotes/facts
|
|
72
|
+
r.violations # tuple of failed criteria
|
|
73
|
+
r.metadata # provider, model, token usage, latency, cost
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
### Built-in rubrics
|
|
77
|
+
|
|
78
|
+
`factuality`, `groundedness` (requires `context=`), `relevance`,
|
|
79
|
+
`instruction_following`, `safety`. List them with
|
|
80
|
+
`llm_judge_kit.available_rubrics()`.
|
|
81
|
+
|
|
82
|
+
```python
|
|
83
|
+
judge = Judge(provider="openai:gpt-5", rubric="groundedness")
|
|
84
|
+
result = judge.score(question, answer, context=retrieved_docs) # RAG check
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
## Consensus (vote across models)
|
|
88
|
+
|
|
89
|
+
Run several judge models and aggregate — `confidence` reflects how much they
|
|
90
|
+
**agree** ([`examples/consensus.py`](examples/consensus.py)):
|
|
91
|
+
|
|
92
|
+
```python
|
|
93
|
+
judge = Judge.consensus(
|
|
94
|
+
["openai:gpt-5", "anthropic:claude-opus-4-8", "ollama:llama3"],
|
|
95
|
+
rubric="factuality",
|
|
96
|
+
)
|
|
97
|
+
result = judge.score(prompt, response)
|
|
98
|
+
result.score # mean (or median) of member scores
|
|
99
|
+
result.confidence # high when members agree, low when they diverge
|
|
100
|
+
result.metadata["votes"] # each member's score
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
## Reliability & caching
|
|
104
|
+
|
|
105
|
+
Both wrappers are providers, so they compose around any backend
|
|
106
|
+
([`examples/reliability_and_cache.py`](examples/reliability_and_cache.py)):
|
|
107
|
+
|
|
108
|
+
```python
|
|
109
|
+
from llm_judge_kit import Judge, OpenAIProvider, RetryProvider, CachingProvider
|
|
110
|
+
|
|
111
|
+
provider = CachingProvider( # memoize identical calls
|
|
112
|
+
RetryProvider( # retry w/ backoff + timeout
|
|
113
|
+
OpenAIProvider(model="gpt-5"), retries=3, timeout=30,
|
|
114
|
+
)
|
|
115
|
+
)
|
|
116
|
+
judge = Judge(provider=provider, rubric="factuality")
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
The cache key is `version + provider + model + prompt + kwargs`, so it is stable
|
|
120
|
+
and invalidates correctly across library versions. Logs are emitted on the
|
|
121
|
+
`llm_judge_kit` logger (silent by default; call `enable_debug_logging()` to see them).
|
|
122
|
+
|
|
123
|
+
## Integrations
|
|
124
|
+
|
|
125
|
+
### pytest — eval as ordinary tests
|
|
126
|
+
|
|
127
|
+
Installing `llm_judge_kit` registers a pytest plugin (no conftest wiring). The
|
|
128
|
+
`llm_judge_kit` fixture turns an eval into a normal test; a failure reads like any
|
|
129
|
+
other failing assertion (score, reason, violations):
|
|
130
|
+
|
|
131
|
+
```python
|
|
132
|
+
def test_answer_is_grounded(llm_judge_kit):
|
|
133
|
+
llm_judge_kit.assert_passes(
|
|
134
|
+
prompt="How tall is the Eiffel Tower?",
|
|
135
|
+
response=my_rag_pipeline("How tall is the Eiffel Tower?"),
|
|
136
|
+
rubric="groundedness",
|
|
137
|
+
context=retrieved_docs,
|
|
138
|
+
threshold=0.7,
|
|
139
|
+
)
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
Pick the judge model once for the whole suite — it defaults to `mock` (offline),
|
|
143
|
+
so tests are green until you point them at a real model:
|
|
144
|
+
|
|
145
|
+
```bash
|
|
146
|
+
pytest --llm-judge-kit-provider "openai:gpt-5" # or: export LLM_JUDGE_KIT_PROVIDER=...
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
### Any framework
|
|
150
|
+
|
|
151
|
+
LLMJudge judges *strings*, so it drops into any stack — LangChain, LlamaIndex,
|
|
152
|
+
DSPy, a raw script — with no adapter. Whatever produces the output, pass it in:
|
|
153
|
+
|
|
154
|
+
```python
|
|
155
|
+
output = my_chain.invoke(question) # LangChain / LlamaIndex / your code
|
|
156
|
+
result = Judge(provider="openai:gpt-5", rubric="relevance").score(question, output)
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
## Extend without touching the core
|
|
160
|
+
|
|
161
|
+
Add a **rubric** ([`examples/custom_rubric.py`](examples/custom_rubric.py)):
|
|
162
|
+
|
|
163
|
+
```python
|
|
164
|
+
from llm_judge_kit import Rubric, register_rubric
|
|
165
|
+
|
|
166
|
+
register_rubric(Rubric(
|
|
167
|
+
name="conciseness",
|
|
168
|
+
description="Whether the response is as short as possible while complete.",
|
|
169
|
+
criteria=("No filler or repetition.", "Every sentence earns its place."),
|
|
170
|
+
))
|
|
171
|
+
judge = Judge(provider="openai:gpt-5", rubric="conciseness")
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
Add a **provider** — implement one method, optionally register a scheme:
|
|
175
|
+
|
|
176
|
+
```python
|
|
177
|
+
from llm_judge_kit import ProviderResponse, register_provider
|
|
178
|
+
|
|
179
|
+
class MyProvider:
|
|
180
|
+
name = "mine"
|
|
181
|
+
def complete(self, prompt: str, **kwargs: object) -> ProviderResponse:
|
|
182
|
+
return ProviderResponse(text=call_my_model(prompt))
|
|
183
|
+
|
|
184
|
+
register_provider("mine", lambda model: MyProvider())
|
|
185
|
+
judge = Judge(provider="mine:v1", rubric="relevance")
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
## CLI & batch evaluation
|
|
189
|
+
|
|
190
|
+
Score a whole dataset and get a report — JSON, Markdown, or HTML. A dataset is
|
|
191
|
+
JSON Lines (`prompt` + `response`, optional `context`/`reference`/`id`); see
|
|
192
|
+
[`examples/sample_dataset.jsonl`](examples/sample_dataset.jsonl).
|
|
193
|
+
|
|
194
|
+
```bash
|
|
195
|
+
llm-judge-kit eval cases.jsonl --provider openai:gpt-5 --rubric factuality --format md
|
|
196
|
+
llm-judge-kit eval cases.jsonl --fail-under 0.9 # exit non-zero in CI if pass rate drops
|
|
197
|
+
llm-judge-kit compare cases.jsonl --provider openai:gpt-5 --provider anthropic:claude-opus-4-8
|
|
198
|
+
llm-judge-kit report report.json --format html -o report.html
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
Same thing in code ([`examples/benchmark_report.py`](examples/benchmark_report.py)):
|
|
202
|
+
|
|
203
|
+
```python
|
|
204
|
+
from llm_judge_kit import Judge, load_dataset, run_benchmark, render_markdown
|
|
205
|
+
|
|
206
|
+
cases = load_dataset("cases.jsonl")
|
|
207
|
+
judge = Judge(provider="openai:gpt-5", rubric="factuality")
|
|
208
|
+
report = run_benchmark(judge, cases, provider="openai:gpt-5", rubric="factuality")
|
|
209
|
+
print(report.pass_rate, report.mean_score)
|
|
210
|
+
print(render_markdown(report))
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
## Why depend on this
|
|
214
|
+
|
|
215
|
+
- **Easy to depend on** — zero transitive deps in the core; provider SDKs are opt-in extras.
|
|
216
|
+
- **Reproducible** — deterministic offline `MockProvider`; all unit tests run without network.
|
|
217
|
+
- **Typed** — `mypy --strict` clean; ships `py.typed`.
|
|
218
|
+
- **Robust parsing** — recovers JSON from markdown fences, prose, and trailing commas.
|
|
219
|
+
- **Extensible** — new provider / rubric / judge without core changes.
|
|
220
|
+
|
|
221
|
+
## Development
|
|
222
|
+
|
|
223
|
+
```bash
|
|
224
|
+
uv sync --all-extras
|
|
225
|
+
uv run ruff check . && uv run ruff format --check . && uv run mypy src && uv run pytest --cov=llm_judge_kit --cov-report=term-missing
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
See [CONTRIBUTING.md](CONTRIBUTING.md). The plan of record is in [ROADMAP.md](ROADMAP.md).
|
|
229
|
+
|
|
230
|
+
## License
|
|
231
|
+
|
|
232
|
+
MIT — see [LICENSE](LICENSE).
|