ruby_llm-contract 0.10.5 → 0.10.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +9 -0
- data/docs/guide/llm_judge.md +2 -2
- data/docs/guide/relation_to_tribunal.md +28 -0
- data/lib/ruby_llm/contract/version.rb +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 3a8b39b568c6799c294c1f303f7d727e7c2daaf71762c84af9e255f99ba65482
|
|
4
|
+
data.tar.gz: 890deebfa3a297be4bb65e219ea3db1ea3e8a03e72dd0e51ce74ae04de62385c
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 64ab1a5b8ac27e460356bafcf37e1489df89990c238bdec6d352dd3e9201433addaa48a157528671d1785946e8fd9e502d0d0bbdccb8ff3d6ecfb59334a5ce03
|
|
7
|
+
data.tar.gz: adaf5ac758ab00aabc58b3d7f2544db74c36649b90313b3a1732c5bb5a42986ba9f721d972e4bb9744c5c1f94e46c75dd5223dedd2ebfd64e89975dacb3b9825
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,14 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.10.6 (2026-06-11)
|
|
4
|
+
|
|
5
|
+
Docs accuracy patch: correct the positioning of `llm_judge.md` against `ruby_llm-tribunal` after a deeper audit of Tribunal's documented scope. The previous wording ("you are reinventing what `ruby_llm-tribunal` ships as a built-in catalog") read as if Tribunal made `llm_judge.md` redundant — incorrect. Tribunal's README ships an off-the-shelf implementation catalog (`assert_faithful`, `assert_hallucination`, `assert_refusal`, `assert_no_pii`, etc.) but does **not** document calibration workflow, per-claim breakdown, judge-prompt iteration, or judge-as-`evaluator:` integration — exactly the methodology `llm_judge.md` covers. The two are complementary layers, not alternatives. No code behaviour change.
|
|
6
|
+
|
|
7
|
+
### Fixed
|
|
8
|
+
|
|
9
|
+
- **`docs/guide/relation_to_tribunal.md`** — added the **"Tribunal's catalog vs Contract's `llm_judge.md` — concrete decision tree"** sub-section under "When to use which", giving adopters a sharp three-way fork: reach for Tribunal's catalog when the check is domain-general (faithfulness vs context, hallucination, refusal, PII, jailbreak, toxicity, bias); build a custom judge per `llm_judge.md` when the criterion is domain-specific, when the judge needs to live inside a `define_eval` regression gate as the `evaluator:` lambda, when per-claim sentence-level debug output is required, or when the judge prompt itself needs to be iterated against your data; use both for the same project at different lifecycle stages (Tribunal at spec-time, calibrated custom judge at CI merge gate). Added the **"What Tribunal documents — and what it doesn't"** sub-section with a five-row comparison table making explicit which methodology gaps `llm_judge.md` covers that Tribunal's README leaves to the adopter (calibration against humans, prompt iteration on over-flag, per-claim breakdown, evaluator-lambda integration, the six anti-patterns).
|
|
10
|
+
- **`docs/guide/llm_judge.md`** — rewrote the closing "When to escalate to Tribunal's catalog" section as **"When to reach for Tribunal instead"**: shorter, accurate (Tribunal is a complementary catalog, not a replacement), points to `relation_to_tribunal.md` for the full decision tree and integration patterns. The previous wording implied that building any of the four standard judges (faithful / hallucination / refusal / PII) was "reinvention" — true for the **implementation** (Tribunal ships them), false for the **methodology** (Tribunal's README doesn't document calibration, anti-patterns, or per-claim breakdown). The methodology applies equally to Tribunal's built-ins, Tribunal's custom registered judges, and Contract `Step::Base` judges.
|
|
11
|
+
|
|
3
12
|
## 0.10.5 (2026-06-11)
|
|
4
13
|
|
|
5
14
|
Docs release: new `llm_judge.md` guide + comprehensive clarity audit across all 16 shipping documentation files. No code behaviour change.
|
data/docs/guide/llm_judge.md
CHANGED
|
@@ -195,9 +195,9 @@ What to do operationally:
|
|
|
195
195
|
|
|
196
196
|
The drop is your signal to refine the judge's prompt, not to lower the gate.
|
|
197
197
|
|
|
198
|
-
## When to
|
|
198
|
+
## When to reach for Tribunal instead
|
|
199
199
|
|
|
200
|
-
|
|
200
|
+
[`ruby_llm-tribunal`](https://github.com/Alqemist-labs/ruby_llm-tribunal) ships an off-the-shelf catalog of common LLM-as-judge assertions (`assert_faithful`, `assert_hallucination`, `assert_refusal`, `assert_no_pii`, etc.) — a shortcut when your check matches one of those domain-general categories. The methodology in this guide still applies: calibrate the judge against your human-labeled production data **before** trusting Tribunal's `default_threshold = 0.8`, refine the prompt when it over-flags, watch for the anti-patterns above. See [Relation to Tribunal](relation_to_tribunal.md) for the full positioning — what each gem documents (and doesn't), a concrete decision tree on catalog-vs-custom-judge, and three working integration patterns.
|
|
201
201
|
|
|
202
202
|
## See also
|
|
203
203
|
|
|
@@ -73,6 +73,34 @@ Tribunal grades **a fixed set of cases on every PR** to catch quality regression
|
|
|
73
73
|
|
|
74
74
|
**Both.** You ship contracts in prod (Contract) AND want stronger CI signal beyond schema regression — judge-quality grading on a frozen dataset, plus adversarial red-team probes. Use Contract's `Step` to make the call, run it in `define_eval` over your dataset, and grade each case with Tribunal helpers in your spec or via the dataset's `evaluator:` proc.
|
|
75
75
|
|
|
76
|
+
### Tribunal's catalog vs Contract's `llm_judge.md` — concrete decision tree
|
|
77
|
+
|
|
78
|
+
If you specifically need an **LLM-as-judge** (a second LLM grading the first one's output), the decision is:
|
|
79
|
+
|
|
80
|
+
- **Reach for Tribunal's catalog** when your check is one of the well-defined, domain-general categories Tribunal ships: *"is this faithful to the retrieved context?"*, *"is this a refusal?"*, *"does this contain PII?"*, *"is this jailbreak-resistant?"*, *"hallucinated?"*, *"toxic?"*, *"biased?"*. One line in a spec, default threshold, no judge code to write or maintain. The judge prompt is baked into the gem.
|
|
81
|
+
- **Build a custom judge per [`llm_judge.md`](llm_judge.md)** when:
|
|
82
|
+
- **Your criterion is domain-specific** — *"does this medical advice match our internal safety policy?"*, *"is this reply in our brand voice?"*, *"does this summary preserve the legal disclaimer verbatim?"*. No off-the-shelf judge knows your policy; you write the prompt.
|
|
83
|
+
- **You need the verdict inside a `define_eval` regression gate** (the `evaluator:` lambda pattern) — Tribunal's surface is spec-time assertions, not eval-framework evaluators.
|
|
84
|
+
- **You need a per-claim breakdown** (sentence-level *"this claim → unsupported, that claim → contradicted"* output) for PR debugging — Tribunal returns one score per assertion.
|
|
85
|
+
- **You need to iterate the judge prompt** because it over-flags on your data — Tribunal's prompts are fixed per assertion.
|
|
86
|
+
- **Use both** for the same project even when your check is in Tribunal's catalog: Tribunal's `assert_faithful` for spec-time grade on individual responses, plus a calibrated custom judge wired as `evaluator:` in a regression `define_eval` over a frozen dataset for CI merge-gating. They cover different lifecycle stages.
|
|
87
|
+
|
|
88
|
+
Either way, the **methodology** in [`llm_judge.md`](llm_judge.md) — calibrate the judge against human-labeled production samples before trusting any score, watch for the six anti-patterns, refine the prompt when it over-flags — applies equally to Tribunal's built-ins, Tribunal's custom registered judges, and Contract `Step::Base` judges. Tribunal's `default_threshold = 0.8` is a starting point, not a calibrated bar for your data.
|
|
89
|
+
|
|
90
|
+
## What Tribunal documents — and what it doesn't
|
|
91
|
+
|
|
92
|
+
Tribunal's README ships an **implementation catalog** (`assert_faithful`, `assert_hallucination`, `assert_refusal`, `assert_no_pii`, `assert_no_toxicity`, `assert_no_bias`, `assert_jailbreak_resistant`, etc., plus a `register_judge` API for custom ones). What it currently leaves to the adopter:
|
|
93
|
+
|
|
94
|
+
| Tribunal ships | Tribunal's README doesn't document (Contract's [`llm_judge.md`](llm_judge.md) does) |
|
|
95
|
+
|---|---|
|
|
96
|
+
| `default_threshold = 0.8` (fixed) | How to **calibrate** the threshold against your human-labeled production data |
|
|
97
|
+
| Judge prompt baked in per assertion | How to **iterate the judge prompt** when it over-flags stylistic courtesy as drift |
|
|
98
|
+
| Single score per assertion | **Per-claim breakdown** schema for sentence-level PR debugging |
|
|
99
|
+
| `assert_faithful` in a spec | **Judge as `evaluator:` lambda** in a Contract `define_eval` regression gate |
|
|
100
|
+
| Custom Judge mechanism (`register_judge`) | **Anti-patterns** (stubbing the verdict, calibrating on synthetic data, calibrating once and shipping) |
|
|
101
|
+
|
|
102
|
+
This is a **complementary gap**, not a competition. Tribunal owns the implementation catalog; Contract's `llm_judge.md` owns the methodology. A typical production setup uses both layers: pick (or build) the implementation, then calibrate it against your humans **before** trusting any score.
|
|
103
|
+
|
|
76
104
|
## Integration patterns
|
|
77
105
|
|
|
78
106
|
These work today without any code changes in either gem — both use plain Ruby blocks/procs as extension points.
|