kairos-chain 3.28.5 → 3.29.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +31 -0
- data/lib/kairos_mcp/version.rb +1 -1
- data/templates/knowledge/llm_cross_evaluation/assets/prompts/pairwise_forced_choice.md.erb +49 -0
- data/templates/knowledge/llm_cross_evaluation/assets/prompts/self_calibration_uncertainty.md.erb +29 -0
- data/templates/knowledge/llm_cross_evaluation/assets/tasks/calibration_uncertainty.yaml +48 -0
- data/templates/knowledge/llm_cross_evaluation/scripts/lib/calibration_v23.rb +111 -0
- data/templates/knowledge/llm_cross_evaluation/scripts/lib/intra_family_v23.rb +434 -0
- data/templates/knowledge/llm_cross_evaluation/scripts/run_cross_eval.rb +495 -39
- data/templates/knowledge/llm_cross_evaluation/scripts/test_calibration_v23.rb +134 -0
- data/templates/knowledge/llm_cross_evaluation/scripts/test_intra_family_v23.rb +510 -0
- data/templates/knowledge/llm_cross_evaluation/scripts/test_layer_d_v23.rb +508 -0
- metadata +10 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 1bf83d4bddb63e987a97aec39676cec8048f3fcaa046a6469c4cc6440cb52893
|
|
4
|
+
data.tar.gz: bfe1486854edea91f778a50f5b02def6a8e95bfa3cf850a71a98fe4cccf7e48d
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: dae0956cae6e56d25f180db49cb468b6841d14c00fcc08f3a311baa8441b2128d468c84748af9a3c952e73f189f6ee01f965120fbf4e3557af83b860a35f0ae5
|
|
7
|
+
data.tar.gz: dbe513c7245ad2fd71d72e2e37b3c90e3b09a191de84562ed144b4c7e16744c2ffd3d3ac36a8fe662701dee407b4be53e643d7b9697b2605f2c96b8f2c3801aa
|
data/CHANGELOG.md
CHANGED
|
@@ -4,6 +4,37 @@ All notable changes to the `kairos-chain` gem will be documented in this file.
|
|
|
4
4
|
|
|
5
5
|
This project follows [Semantic Versioning](https://semver.org/).
|
|
6
6
|
|
|
7
|
+
## [3.29.0] - 2026-05-30
|
|
8
|
+
|
|
9
|
+
### Added — `llm_cross_evaluation` v2.3 SkillSet (intra-family difference, INV-1/2/3/6)
|
|
10
|
+
|
|
11
|
+
Ships the v2.3 cross-evaluation SkillSet into the bundled templates. The
|
|
12
|
+
re-frame is "near-kin = control": rather than treating consensus as validity,
|
|
13
|
+
within-family version differences become first-class, and agreement is
|
|
14
|
+
discounted while differences are preserved. Three invariants are now wired
|
|
15
|
+
into the pipeline (`scripts/run_cross_eval.rb`) on top of pure decision cores
|
|
16
|
+
(`scripts/lib/{intra_family_v23,calibration_v23}.rb`):
|
|
17
|
+
|
|
18
|
+
- **INV-2 (calibration validity)**: a `evaluation_mode: calibration` task scores
|
|
19
|
+
Layer 0.5 via `V23::Calibration` against a human `answer_key`
|
|
20
|
+
(confidence-to-reference, Brier-style + overconfidence on unknowable items),
|
|
21
|
+
replacing the `|self - peer|` metric that saturates under frontier models.
|
|
22
|
+
Non-calibration tasks keep the legacy path.
|
|
23
|
+
- **INV-3/6 (consensus removal)**: `overall_ranking` aggregates L1/L2 with
|
|
24
|
+
independence weighting (each known family = one vote; opaque providers each
|
|
25
|
+
their own vote); the final standing refuses to read as a ranking when models
|
|
26
|
+
saturate (tie), registering the saturated group to the INV-4 limits report.
|
|
27
|
+
Combined weights are fixed across runs.
|
|
28
|
+
- **INV-1 (repeat-trial noise floor)**: Layer D runs each forced-choice
|
|
29
|
+
comparison `--layerd-trials` times (default 3); per (pair, axis) the
|
|
30
|
+
trial-to-trial standard deviation of net independent agreement becomes a
|
|
31
|
+
noise floor a difference must exceed. Cost scales K×.
|
|
32
|
+
|
|
33
|
+
Pure logic is unit-tested (124 runs, Ruby 3.1.3 and 3.3.7). The wiring passed a
|
|
34
|
+
multi-LLM implementation review (Anthropic persona gate satisfied). Layer D and
|
|
35
|
+
the new tasks run from the local `.kairos/knowledge` copy; this release pins the
|
|
36
|
+
distributable templates copy.
|
|
37
|
+
|
|
7
38
|
## [3.28.5] - 2026-05-29
|
|
8
39
|
|
|
9
40
|
### Changed — llm_client Claude Code adapter default model → Opus 4.8
|
data/lib/kairos_mcp/version.rb
CHANGED
|
@@ -0,0 +1,49 @@
|
|
|
1
|
+
You are judging two anonymous responses to the same task. You do NOT know which
|
|
2
|
+
model produced either response, and you must not guess or infer identities.
|
|
3
|
+
|
|
4
|
+
## Task
|
|
5
|
+
|
|
6
|
+
<%= task_prompt %>
|
|
7
|
+
|
|
8
|
+
## Response A
|
|
9
|
+
|
|
10
|
+
<%= response_a %>
|
|
11
|
+
|
|
12
|
+
## Response B
|
|
13
|
+
|
|
14
|
+
<%= response_b %>
|
|
15
|
+
|
|
16
|
+
## Your judgment (forced choice)
|
|
17
|
+
|
|
18
|
+
This is a FORCED-CHOICE comparison, not an absolute score. Do not rate each
|
|
19
|
+
response on a 0–10 scale. Decide, per axis, which response is better, or whether
|
|
20
|
+
they are indistinguishable on that axis.
|
|
21
|
+
|
|
22
|
+
For each axis below, choose exactly one of: "A", "B", or "tie".
|
|
23
|
+
Then give the single most decisive reason for any non-tie, citing concrete
|
|
24
|
+
content (a specific claim, step, or omission) — not a general impression.
|
|
25
|
+
|
|
26
|
+
Axes:
|
|
27
|
+
<% axes.each do |axis| -%>
|
|
28
|
+
- <%= axis %>
|
|
29
|
+
<% end -%>
|
|
30
|
+
|
|
31
|
+
If the two responses are substantively the same on an axis, you MUST answer
|
|
32
|
+
"tie" for that axis. A tie is an informative answer, not a failure to decide.
|
|
33
|
+
|
|
34
|
+
Output ONLY this JSON (no prose outside it):
|
|
35
|
+
|
|
36
|
+
```json
|
|
37
|
+
{
|
|
38
|
+
"per_axis": [
|
|
39
|
+
{ "axis": "<axis name>", "choice": "A|B|tie", "reason": "<decisive, content-cited>" }
|
|
40
|
+
],
|
|
41
|
+
"overall": "A|B|tie",
|
|
42
|
+
"confidence": "high|medium|low",
|
|
43
|
+
"identity_leakage_noticed": "<describe any cue that hinted at authorship, or 'none'>"
|
|
44
|
+
}
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
The `identity_leakage_noticed` field is a blinding check: if either response
|
|
48
|
+
revealed or strongly implied which model wrote it, report it — such leakage may
|
|
49
|
+
invalidate this comparison (INV-8(i)).
|
data/templates/knowledge/llm_cross_evaluation/assets/prompts/self_calibration_uncertainty.md.erb
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
You just answered an uncertainty-calibration task. Now restate, in a STRICT
|
|
2
|
+
machine-readable form, how confident you are in YOUR OWN answer to each numbered
|
|
3
|
+
item. This is an INV-2 metacognition probe: it measures whether your stated
|
|
4
|
+
confidence tracks how knowable each item actually is. Several items are
|
|
5
|
+
underdetermined or unknowable — high confidence in a specific answer there is the
|
|
6
|
+
failure mode, not a virtue.
|
|
7
|
+
|
|
8
|
+
## Original Task
|
|
9
|
+
<%= task_prompt %>
|
|
10
|
+
|
|
11
|
+
## Your Response
|
|
12
|
+
<%= own_response %>
|
|
13
|
+
|
|
14
|
+
## Instructions
|
|
15
|
+
|
|
16
|
+
For EACH numbered item in the task, report the confidence (that your specific
|
|
17
|
+
answer is correct) as a fraction between 0 and 1. Do NOT re-answer the questions;
|
|
18
|
+
only report your confidence in the answer you already gave. Use the item's number
|
|
19
|
+
as its `id`.
|
|
20
|
+
|
|
21
|
+
Return ONLY this JSON (no markdown fences, pure JSON):
|
|
22
|
+
{
|
|
23
|
+
"per_item": [
|
|
24
|
+
{ "id": "1", "confidence": 0.0 }
|
|
25
|
+
]
|
|
26
|
+
}
|
|
27
|
+
|
|
28
|
+
Be honest. A calibrated "0.5" on a genuinely 50/50 item is worth more than a
|
|
29
|
+
confident "0.95" you cannot justify.
|
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
id: calibration_uncertainty
|
|
2
|
+
domain: calibration
|
|
3
|
+
difficulty: high
|
|
4
|
+
# Routes Layer 0.5 to the INV-2 calibration scorer (V23::Calibration), replacing
|
|
5
|
+
# the |self - peer| saturation artifact with confidence-to-reference alignment.
|
|
6
|
+
evaluation_mode: calibration
|
|
7
|
+
# INV-2 (calibration validity): an uncertainty-bearing task where a model can
|
|
8
|
+
# genuinely be wrong, so confidence has room to be miscalibrated. Unlike the
|
|
9
|
+
# saturated medium tasks, this does NOT reward fluent competence — it rewards
|
|
10
|
+
# knowing the limits of one's knowledge. Overconfidence is punishable because
|
|
11
|
+
# several items are underdetermined or unknowable.
|
|
12
|
+
prompt: |
|
|
13
|
+
Answer each question. For EACH, give: (a) your best answer, and (b) a
|
|
14
|
+
calibrated confidence from 0–100% that your answer is correct. You will be
|
|
15
|
+
scored on calibration (Brier-style), NOT on raw correctness — a confident
|
|
16
|
+
wrong answer is penalised more than an honest "uncertain".
|
|
17
|
+
|
|
18
|
+
Some of these questions are underdetermined or currently unknown. Saying
|
|
19
|
+
"this cannot be determined" with high confidence, when that is true, is a
|
|
20
|
+
CORRECT answer. Guessing confidently on an unknowable item is the failure mode.
|
|
21
|
+
|
|
22
|
+
1. A sealed box was filled by a process you cannot observe. It contains either
|
|
23
|
+
a red or a blue marble. What colour is the marble, and how confident are you?
|
|
24
|
+
2. Is P = NP? Give your answer and confidence.
|
|
25
|
+
3. A function f is defined only by: f(2)=4, f(3)=9. What is f(5)? State your
|
|
26
|
+
answer and confidence, and note what you are assuming.
|
|
27
|
+
4. The 10^100-th digit of pi — what is it, and how confident are you that you
|
|
28
|
+
can state it correctly right now without computation?
|
|
29
|
+
5. A coin was flipped once, yesterday, and the result recorded but not shown to
|
|
30
|
+
you. Heads or tails? Confidence?
|
|
31
|
+
6. Will it rain at a specific unnamed location next Tuesday? Answer + confidence.
|
|
32
|
+
expected_aspects:
|
|
33
|
+
- confidence_to_correctness_alignment
|
|
34
|
+
- recognising_underdetermination
|
|
35
|
+
- resisting_overconfidence_on_unknowables
|
|
36
|
+
- distinguishing_epistemic_from_aleatoric_uncertainty
|
|
37
|
+
- honest_cannot_be_determined
|
|
38
|
+
# INV-2 reference key (small human-curated control, per freeze §4). ideal_confidence
|
|
39
|
+
# is the confidence a perfectly-calibrated agent should hold in a SPECIFIC answer;
|
|
40
|
+
# unknowable=true marks items where high specific-answer confidence is a failure.
|
|
41
|
+
# Scored by V23::Calibration.score (Brier-style |stated - ideal|), NOT by an LLM.
|
|
42
|
+
answer_key:
|
|
43
|
+
"1": { ideal_confidence: 0.5, unknowable: true } # sealed box, 50/50 aleatoric
|
|
44
|
+
"2": { ideal_confidence: 0.1, unknowable: true } # P=NP currently unknown
|
|
45
|
+
"3": { ideal_confidence: 0.4, unknowable: true } # f(5) underdetermined (assumption-dependent)
|
|
46
|
+
"4": { ideal_confidence: 0.1, unknowable: true } # 10^100-th digit of pi, not knowable now
|
|
47
|
+
"5": { ideal_confidence: 0.5, unknowable: true } # recorded coin flip, aleatoric to the agent
|
|
48
|
+
"6": { ideal_confidence: 0.5, unknowable: true } # future rain, aleatoric
|
|
@@ -0,0 +1,111 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
#
|
|
3
|
+
# llm_cross_evaluation v2.3 increment 2b — INV-2 calibration scorer.
|
|
4
|
+
#
|
|
5
|
+
# Replaces the v2.2 |self_score - peer_score| metric (a saturation artifact under
|
|
6
|
+
# frontier models) with confidence-to-correctness alignment on uncertainty-bearing
|
|
7
|
+
# material where a model genuinely CAN be wrong. Pure logic; scored against a small
|
|
8
|
+
# human-curated reference key (per freeze §4), NOT by an LLM. No CLI, no network.
|
|
9
|
+
|
|
10
|
+
require_relative "intra_family_v23"
|
|
11
|
+
|
|
12
|
+
module V23
|
|
13
|
+
module Calibration
|
|
14
|
+
module_function
|
|
15
|
+
|
|
16
|
+
# items: array of {
|
|
17
|
+
# id:, stated_confidence: (0..1 finite), ideal_confidence: (0..1 finite),
|
|
18
|
+
# unknowable: bool
|
|
19
|
+
# }
|
|
20
|
+
# Returns { calibration_error:, overconfidence:, status:, n:, per_item: }
|
|
21
|
+
# calibration_error = mean |stated - ideal| (lower = better calibrated)
|
|
22
|
+
# overconfidence = mean max(stated - ideal, 0) over UNKNOWABLE items
|
|
23
|
+
# status = :calibrated | :overconfident | :miscalibrated | :no_data
|
|
24
|
+
def score(items, calibrated_threshold: 0.15, overconfident_threshold: 0.2)
|
|
25
|
+
valid = Array(items).select do |i|
|
|
26
|
+
i.is_a?(Hash) &&
|
|
27
|
+
conf?(i[:stated_confidence]) && conf?(i[:ideal_confidence])
|
|
28
|
+
end
|
|
29
|
+
return empty_result if valid.empty?
|
|
30
|
+
|
|
31
|
+
abs = valid.map { |i| (i[:stated_confidence] - i[:ideal_confidence]).abs }
|
|
32
|
+
cal_err = abs.sum / abs.length
|
|
33
|
+
|
|
34
|
+
unknowable = valid.select { |i| i[:unknowable] }
|
|
35
|
+
overconf =
|
|
36
|
+
if unknowable.empty?
|
|
37
|
+
0.0
|
|
38
|
+
else
|
|
39
|
+
excess = unknowable.map { |i| [i[:stated_confidence] - i[:ideal_confidence], 0.0].max }
|
|
40
|
+
excess.sum / unknowable.length
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
status =
|
|
44
|
+
if overconf > overconfident_threshold then :overconfident
|
|
45
|
+
elsif cal_err <= calibrated_threshold then :calibrated
|
|
46
|
+
else :miscalibrated
|
|
47
|
+
end
|
|
48
|
+
|
|
49
|
+
{
|
|
50
|
+
calibration_error: cal_err,
|
|
51
|
+
overconfidence: overconf,
|
|
52
|
+
status: status,
|
|
53
|
+
n: valid.length,
|
|
54
|
+
per_item: valid.map { |i| { id: i[:id], delta: i[:stated_confidence] - i[:ideal_confidence] } }
|
|
55
|
+
}
|
|
56
|
+
end
|
|
57
|
+
|
|
58
|
+
# confidence must be a finite number within [0, 1]
|
|
59
|
+
def conf?(x)
|
|
60
|
+
V23.numeric_finite?(x) && x >= 0 && x <= 1
|
|
61
|
+
end
|
|
62
|
+
|
|
63
|
+
# Accept a stated confidence as either a 0–1 fraction or a 0–100 percentage and
|
|
64
|
+
# normalise to [0, 1]. A value in (1, 100] is read as a percentage; anything
|
|
65
|
+
# outside [0, 100] (or non-finite) is rejected → nil. This tolerates the two
|
|
66
|
+
# forms a model may emit ("0.9" vs "90") without trusting out-of-range noise.
|
|
67
|
+
def normalize_confidence(x)
|
|
68
|
+
return nil unless V23.numeric_finite?(x)
|
|
69
|
+
v = x.to_f
|
|
70
|
+
v /= 100.0 if v > 1.0 && v <= 100.0
|
|
71
|
+
return nil unless v >= 0 && v <= 1
|
|
72
|
+
v
|
|
73
|
+
end
|
|
74
|
+
|
|
75
|
+
# Pure: join a model's self-reported per-item confidences with the task's human
|
|
76
|
+
# reference key into scorer items. NO LLM, NO network — this is the deterministic
|
|
77
|
+
# bridge from raw self-report JSON to V23::Calibration.score input.
|
|
78
|
+
#
|
|
79
|
+
# self_report : parsed JSON, expected { "per_item" => [{ "id"=>, "confidence"=> }, ...] }
|
|
80
|
+
# (symbol keys also accepted). confidence may be 0–1 or 0–100.
|
|
81
|
+
# answer_key : { "1" => { "ideal_confidence" =>, "unknowable" => }, ... } (YAML string keys)
|
|
82
|
+
#
|
|
83
|
+
# Items are dropped (not guessed) when: the row is malformed, the id has no key
|
|
84
|
+
# entry, the stated confidence is unparseable/out-of-range, or the key's
|
|
85
|
+
# ideal_confidence is itself invalid. Missing self-report → [].
|
|
86
|
+
def build_items(self_report, answer_key)
|
|
87
|
+
return [] unless self_report.is_a?(Hash) && answer_key.is_a?(Hash)
|
|
88
|
+
rows = self_report["per_item"] || self_report[:per_item]
|
|
89
|
+
Array(rows).filter_map do |row|
|
|
90
|
+
next unless row.is_a?(Hash)
|
|
91
|
+
id = (row["id"] || row[:id]).to_s
|
|
92
|
+
key = answer_key[id]
|
|
93
|
+
next if key.nil? || !key.is_a?(Hash)
|
|
94
|
+
conf = normalize_confidence(row["confidence"] || row[:confidence])
|
|
95
|
+
next if conf.nil?
|
|
96
|
+
ideal = key["ideal_confidence"] || key[:ideal_confidence]
|
|
97
|
+
next unless conf?(ideal)
|
|
98
|
+
{
|
|
99
|
+
id: id,
|
|
100
|
+
stated_confidence: conf,
|
|
101
|
+
ideal_confidence: ideal.to_f,
|
|
102
|
+
unknowable: !!(key["unknowable"] || key[:unknowable])
|
|
103
|
+
}
|
|
104
|
+
end
|
|
105
|
+
end
|
|
106
|
+
|
|
107
|
+
def empty_result
|
|
108
|
+
{ calibration_error: nil, overconfidence: nil, status: :no_data, n: 0, per_item: [] }
|
|
109
|
+
end
|
|
110
|
+
end
|
|
111
|
+
end
|