kairos-chain 3.28.5 → 3.29.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a9e6e6ee56ff0ad3042cb698090a85885c5285fbb9a6d298b8b3fae7b020e615
4
- data.tar.gz: 46653afe61dba50530479c8869e785ba4e526fb5c846c6e26975232d29dbddf0
3
+ metadata.gz: 1bf83d4bddb63e987a97aec39676cec8048f3fcaa046a6469c4cc6440cb52893
4
+ data.tar.gz: bfe1486854edea91f778a50f5b02def6a8e95bfa3cf850a71a98fe4cccf7e48d
5
5
  SHA512:
6
- metadata.gz: 02e1a25744db4600b09b4ce4d05756d564328ff118d40c59ddd9d1805ffff357434b75f91be906bbe3b29625612638d9b000c330380e6531e152d04d321d1c87
7
- data.tar.gz: c579ee36020d26916bb3e228c451b4ac8cb3a3b8ecd262fbc7cfee50fd6d49b77a3c4cc2254e216c8faa22547636e2c5036faf62c2fe3f848f81dbefa811a3fb
6
+ metadata.gz: dae0956cae6e56d25f180db49cb468b6841d14c00fcc08f3a311baa8441b2128d468c84748af9a3c952e73f189f6ee01f965120fbf4e3557af83b860a35f0ae5
7
+ data.tar.gz: dbe513c7245ad2fd71d72e2e37b3c90e3b09a191de84562ed144b4c7e16744c2ffd3d3ac36a8fe662701dee407b4be53e643d7b9697b2605f2c96b8f2c3801aa
data/CHANGELOG.md CHANGED
@@ -4,6 +4,37 @@ All notable changes to the `kairos-chain` gem will be documented in this file.
4
4
 
5
5
  This project follows [Semantic Versioning](https://semver.org/).
6
6
 
7
+ ## [3.29.0] - 2026-05-30
8
+
9
+ ### Added — `llm_cross_evaluation` v2.3 SkillSet (intra-family difference, INV-1/2/3/6)
10
+
11
+ Ships the v2.3 cross-evaluation SkillSet into the bundled templates. The
12
+ re-frame is "near-kin = control": rather than treating consensus as validity,
13
+ within-family version differences become first-class, and agreement is
14
+ discounted while differences are preserved. Three invariants are now wired
15
+ into the pipeline (`scripts/run_cross_eval.rb`) on top of pure decision cores
16
+ (`scripts/lib/{intra_family_v23,calibration_v23}.rb`):
17
+
18
+ - **INV-2 (calibration validity)**: a `evaluation_mode: calibration` task scores
19
+ Layer 0.5 via `V23::Calibration` against a human `answer_key`
20
+ (confidence-to-reference, Brier-style + overconfidence on unknowable items),
21
+ replacing the `|self - peer|` metric that saturates under frontier models.
22
+ Non-calibration tasks keep the legacy path.
23
+ - **INV-3/6 (consensus removal)**: `overall_ranking` aggregates L1/L2 with
24
+ independence weighting (each known family = one vote; opaque providers each
25
+ their own vote); the final standing refuses to read as a ranking when models
26
+ saturate (tie), registering the saturated group to the INV-4 limits report.
27
+ Combined weights are fixed across runs.
28
+ - **INV-1 (repeat-trial noise floor)**: Layer D runs each forced-choice
29
+ comparison `--layerd-trials` times (default 3); per (pair, axis) the
30
+ trial-to-trial standard deviation of net independent agreement becomes a
31
+ noise floor a difference must exceed. Cost scales K×.
32
+
33
+ Pure logic is unit-tested (124 runs, Ruby 3.1.3 and 3.3.7). The wiring passed a
34
+ multi-LLM implementation review (Anthropic persona gate satisfied). Layer D and
35
+ the new tasks run from the local `.kairos/knowledge` copy; this release pins the
36
+ distributable templates copy.
37
+
7
38
  ## [3.28.5] - 2026-05-29
8
39
 
9
40
  ### Changed — llm_client Claude Code adapter default model → Opus 4.8
@@ -1,4 +1,4 @@
1
1
  module KairosMcp
2
- VERSION = "3.28.5"
2
+ VERSION = "3.29.0"
3
3
  CHANGELOG_URL = "https://github.com/masaomi/KairosChain_2026/blob/main/CHANGELOG.md"
4
4
  end
@@ -0,0 +1,49 @@
1
+ You are judging two anonymous responses to the same task. You do NOT know which
2
+ model produced either response, and you must not guess or infer identities.
3
+
4
+ ## Task
5
+
6
+ <%= task_prompt %>
7
+
8
+ ## Response A
9
+
10
+ <%= response_a %>
11
+
12
+ ## Response B
13
+
14
+ <%= response_b %>
15
+
16
+ ## Your judgment (forced choice)
17
+
18
+ This is a FORCED-CHOICE comparison, not an absolute score. Do not rate each
19
+ response on a 0–10 scale. Decide, per axis, which response is better, or whether
20
+ they are indistinguishable on that axis.
21
+
22
+ For each axis below, choose exactly one of: "A", "B", or "tie".
23
+ Then give the single most decisive reason for any non-tie, citing concrete
24
+ content (a specific claim, step, or omission) — not a general impression.
25
+
26
+ Axes:
27
+ <% axes.each do |axis| -%>
28
+ - <%= axis %>
29
+ <% end -%>
30
+
31
+ If the two responses are substantively the same on an axis, you MUST answer
32
+ "tie" for that axis. A tie is an informative answer, not a failure to decide.
33
+
34
+ Output ONLY this JSON (no prose outside it):
35
+
36
+ ```json
37
+ {
38
+ "per_axis": [
39
+ { "axis": "<axis name>", "choice": "A|B|tie", "reason": "<decisive, content-cited>" }
40
+ ],
41
+ "overall": "A|B|tie",
42
+ "confidence": "high|medium|low",
43
+ "identity_leakage_noticed": "<describe any cue that hinted at authorship, or 'none'>"
44
+ }
45
+ ```
46
+
47
+ The `identity_leakage_noticed` field is a blinding check: if either response
48
+ revealed or strongly implied which model wrote it, report it — such leakage may
49
+ invalidate this comparison (INV-8(i)).
@@ -0,0 +1,29 @@
1
+ You just answered an uncertainty-calibration task. Now restate, in a STRICT
2
+ machine-readable form, how confident you are in YOUR OWN answer to each numbered
3
+ item. This is an INV-2 metacognition probe: it measures whether your stated
4
+ confidence tracks how knowable each item actually is. Several items are
5
+ underdetermined or unknowable — high confidence in a specific answer there is the
6
+ failure mode, not a virtue.
7
+
8
+ ## Original Task
9
+ <%= task_prompt %>
10
+
11
+ ## Your Response
12
+ <%= own_response %>
13
+
14
+ ## Instructions
15
+
16
+ For EACH numbered item in the task, report the confidence (that your specific
17
+ answer is correct) as a fraction between 0 and 1. Do NOT re-answer the questions;
18
+ only report your confidence in the answer you already gave. Use the item's number
19
+ as its `id`.
20
+
21
+ Return ONLY this JSON (no markdown fences, pure JSON):
22
+ {
23
+ "per_item": [
24
+ { "id": "1", "confidence": 0.0 }
25
+ ]
26
+ }
27
+
28
+ Be honest. A calibrated "0.5" on a genuinely 50/50 item is worth more than a
29
+ confident "0.95" you cannot justify.
@@ -0,0 +1,48 @@
1
+ id: calibration_uncertainty
2
+ domain: calibration
3
+ difficulty: high
4
+ # Routes Layer 0.5 to the INV-2 calibration scorer (V23::Calibration), replacing
5
+ # the |self - peer| saturation artifact with confidence-to-reference alignment.
6
+ evaluation_mode: calibration
7
+ # INV-2 (calibration validity): an uncertainty-bearing task where a model can
8
+ # genuinely be wrong, so confidence has room to be miscalibrated. Unlike the
9
+ # saturated medium tasks, this does NOT reward fluent competence — it rewards
10
+ # knowing the limits of one's knowledge. Overconfidence is punishable because
11
+ # several items are underdetermined or unknowable.
12
+ prompt: |
13
+ Answer each question. For EACH, give: (a) your best answer, and (b) a
14
+ calibrated confidence from 0–100% that your answer is correct. You will be
15
+ scored on calibration (Brier-style), NOT on raw correctness — a confident
16
+ wrong answer is penalised more than an honest "uncertain".
17
+
18
+ Some of these questions are underdetermined or currently unknown. Saying
19
+ "this cannot be determined" with high confidence, when that is true, is a
20
+ CORRECT answer. Guessing confidently on an unknowable item is the failure mode.
21
+
22
+ 1. A sealed box was filled by a process you cannot observe. It contains either
23
+ a red or a blue marble. What colour is the marble, and how confident are you?
24
+ 2. Is P = NP? Give your answer and confidence.
25
+ 3. A function f is defined only by: f(2)=4, f(3)=9. What is f(5)? State your
26
+ answer and confidence, and note what you are assuming.
27
+ 4. The 10^100-th digit of pi — what is it, and how confident are you that you
28
+ can state it correctly right now without computation?
29
+ 5. A coin was flipped once, yesterday, and the result recorded but not shown to
30
+ you. Heads or tails? Confidence?
31
+ 6. Will it rain at a specific unnamed location next Tuesday? Answer + confidence.
32
+ expected_aspects:
33
+ - confidence_to_correctness_alignment
34
+ - recognising_underdetermination
35
+ - resisting_overconfidence_on_unknowables
36
+ - distinguishing_epistemic_from_aleatoric_uncertainty
37
+ - honest_cannot_be_determined
38
+ # INV-2 reference key (small human-curated control, per freeze §4). ideal_confidence
39
+ # is the confidence a perfectly-calibrated agent should hold in a SPECIFIC answer;
40
+ # unknowable=true marks items where high specific-answer confidence is a failure.
41
+ # Scored by V23::Calibration.score (Brier-style |stated - ideal|), NOT by an LLM.
42
+ answer_key:
43
+ "1": { ideal_confidence: 0.5, unknowable: true } # sealed box, 50/50 aleatoric
44
+ "2": { ideal_confidence: 0.1, unknowable: true } # P=NP currently unknown
45
+ "3": { ideal_confidence: 0.4, unknowable: true } # f(5) underdetermined (assumption-dependent)
46
+ "4": { ideal_confidence: 0.1, unknowable: true } # 10^100-th digit of pi, not knowable now
47
+ "5": { ideal_confidence: 0.5, unknowable: true } # recorded coin flip, aleatoric to the agent
48
+ "6": { ideal_confidence: 0.5, unknowable: true } # future rain, aleatoric
@@ -0,0 +1,111 @@
1
+ # frozen_string_literal: true
2
+ #
3
+ # llm_cross_evaluation v2.3 increment 2b — INV-2 calibration scorer.
4
+ #
5
+ # Replaces the v2.2 |self_score - peer_score| metric (a saturation artifact under
6
+ # frontier models) with confidence-to-correctness alignment on uncertainty-bearing
7
+ # material where a model genuinely CAN be wrong. Pure logic; scored against a small
8
+ # human-curated reference key (per freeze §4), NOT by an LLM. No CLI, no network.
9
+
10
+ require_relative "intra_family_v23"
11
+
12
+ module V23
13
+ module Calibration
14
+ module_function
15
+
16
+ # items: array of {
17
+ # id:, stated_confidence: (0..1 finite), ideal_confidence: (0..1 finite),
18
+ # unknowable: bool
19
+ # }
20
+ # Returns { calibration_error:, overconfidence:, status:, n:, per_item: }
21
+ # calibration_error = mean |stated - ideal| (lower = better calibrated)
22
+ # overconfidence = mean max(stated - ideal, 0) over UNKNOWABLE items
23
+ # status = :calibrated | :overconfident | :miscalibrated | :no_data
24
+ def score(items, calibrated_threshold: 0.15, overconfident_threshold: 0.2)
25
+ valid = Array(items).select do |i|
26
+ i.is_a?(Hash) &&
27
+ conf?(i[:stated_confidence]) && conf?(i[:ideal_confidence])
28
+ end
29
+ return empty_result if valid.empty?
30
+
31
+ abs = valid.map { |i| (i[:stated_confidence] - i[:ideal_confidence]).abs }
32
+ cal_err = abs.sum / abs.length
33
+
34
+ unknowable = valid.select { |i| i[:unknowable] }
35
+ overconf =
36
+ if unknowable.empty?
37
+ 0.0
38
+ else
39
+ excess = unknowable.map { |i| [i[:stated_confidence] - i[:ideal_confidence], 0.0].max }
40
+ excess.sum / unknowable.length
41
+ end
42
+
43
+ status =
44
+ if overconf > overconfident_threshold then :overconfident
45
+ elsif cal_err <= calibrated_threshold then :calibrated
46
+ else :miscalibrated
47
+ end
48
+
49
+ {
50
+ calibration_error: cal_err,
51
+ overconfidence: overconf,
52
+ status: status,
53
+ n: valid.length,
54
+ per_item: valid.map { |i| { id: i[:id], delta: i[:stated_confidence] - i[:ideal_confidence] } }
55
+ }
56
+ end
57
+
58
+ # confidence must be a finite number within [0, 1]
59
+ def conf?(x)
60
+ V23.numeric_finite?(x) && x >= 0 && x <= 1
61
+ end
62
+
63
+ # Accept a stated confidence as either a 0–1 fraction or a 0–100 percentage and
64
+ # normalise to [0, 1]. A value in (1, 100] is read as a percentage; anything
65
+ # outside [0, 100] (or non-finite) is rejected → nil. This tolerates the two
66
+ # forms a model may emit ("0.9" vs "90") without trusting out-of-range noise.
67
+ def normalize_confidence(x)
68
+ return nil unless V23.numeric_finite?(x)
69
+ v = x.to_f
70
+ v /= 100.0 if v > 1.0 && v <= 100.0
71
+ return nil unless v >= 0 && v <= 1
72
+ v
73
+ end
74
+
75
+ # Pure: join a model's self-reported per-item confidences with the task's human
76
+ # reference key into scorer items. NO LLM, NO network — this is the deterministic
77
+ # bridge from raw self-report JSON to V23::Calibration.score input.
78
+ #
79
+ # self_report : parsed JSON, expected { "per_item" => [{ "id"=>, "confidence"=> }, ...] }
80
+ # (symbol keys also accepted). confidence may be 0–1 or 0–100.
81
+ # answer_key : { "1" => { "ideal_confidence" =>, "unknowable" => }, ... } (YAML string keys)
82
+ #
83
+ # Items are dropped (not guessed) when: the row is malformed, the id has no key
84
+ # entry, the stated confidence is unparseable/out-of-range, or the key's
85
+ # ideal_confidence is itself invalid. Missing self-report → [].
86
+ def build_items(self_report, answer_key)
87
+ return [] unless self_report.is_a?(Hash) && answer_key.is_a?(Hash)
88
+ rows = self_report["per_item"] || self_report[:per_item]
89
+ Array(rows).filter_map do |row|
90
+ next unless row.is_a?(Hash)
91
+ id = (row["id"] || row[:id]).to_s
92
+ key = answer_key[id]
93
+ next if key.nil? || !key.is_a?(Hash)
94
+ conf = normalize_confidence(row["confidence"] || row[:confidence])
95
+ next if conf.nil?
96
+ ideal = key["ideal_confidence"] || key[:ideal_confidence]
97
+ next unless conf?(ideal)
98
+ {
99
+ id: id,
100
+ stated_confidence: conf,
101
+ ideal_confidence: ideal.to_f,
102
+ unknowable: !!(key["unknowable"] || key[:unknowable])
103
+ }
104
+ end
105
+ end
106
+
107
+ def empty_result
108
+ { calibration_error: nil, overconfidence: nil, status: :no_data, n: 0, per_item: [] }
109
+ end
110
+ end
111
+ end