inspect-claim-support 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,78 @@
1
+ Metadata-Version: 2.4
2
+ Name: inspect-claim-support
3
+ Version: 0.1.0
4
+ Summary: A claim-support / faithfulness scorer for Inspect AI — does the transcript actually substantiate the claimed answer?
5
+ Author: avalyset
6
+ License: MIT
7
+ Keywords: evaluation,faithfulness,groundedness,inspect_ai,scorer
8
+ Classifier: License :: OSI Approved :: MIT License
9
+ Classifier: Operating System :: OS Independent
10
+ Classifier: Programming Language :: Python :: 3
11
+ Requires-Python: >=3.10
12
+ Requires-Dist: inspect-ai
13
+ Provides-Extra: test
14
+ Requires-Dist: pytest; extra == 'test'
15
+ Description-Content-Type: text/markdown
16
+
17
+ # inspect-claim-support
18
+
19
+ A **claim-support** (faithfulness / groundedness) scorer for
20
+ [Inspect AI](https://inspect.aisi.org.uk/), packaged as a standalone extension.
21
+
22
+ `claim_support` assesses whether a claimed answer is *actually substantiated by
23
+ the conversation transcript* — not whether it is correct in absolute terms. It is
24
+ a model-graded scorer with a rubric that maps SUPPORTED / PARTIAL / UNSUPPORTED
25
+ onto Inspect's CORRECT / PARTIAL / INCORRECT, and returns NOANSWER on a grader
26
+ parse failure.
27
+
28
+ ### Why it earns its place: absence isn't support
29
+
30
+ The rubric refuses to let *absence of evidence* pass as support. A negative claim
31
+ like "I made no network calls" only scores SUPPORTED if the transcript is actually
32
+ capable of showing that class of event. If the transcript cannot expose the
33
+ relevant events, the claim is PARTIAL or UNSUPPORTED — never SUPPORTED. This
34
+ surfaces overclaims instead of laundering them through a plausible rationale.
35
+
36
+ The scorer assesses support against the **Inspect transcript only**
37
+ (transcript-visible events), not against actual runtime truth in the environment.
38
+
39
+ ## Install
40
+
41
+ ```bash
42
+ pip install inspect-claim-support
43
+ ```
44
+
45
+ ## Use
46
+
47
+ ```python
48
+ from inspect_ai import Task
49
+ from inspect_claim_support import claim_support
50
+
51
+ task = Task(
52
+ dataset=...,
53
+ solver=...,
54
+ scorer=claim_support(), # optionally: claim_support(model="openai/gpt-4o")
55
+ )
56
+ ```
57
+
58
+ Once installed, the scorer is also resolvable by its namespaced registry name
59
+ `inspect_claim_support/claim_support` via Inspect's setuptools entry point.
60
+
61
+ ### Parameters
62
+
63
+ - `template` — grading template (defaults to a SUPPORTED / PARTIAL / UNSUPPORTED
64
+ rubric with the absence-isn't-support boundary built in).
65
+ - `model` — model to use for grading (defaults to the model being evaluated).
66
+
67
+ ## Origin & credit
68
+
69
+ This scorer originated as
70
+ [UKGovernmentBEIS/inspect_ai#4166](https://github.com/UKGovernmentBEIS/inspect_ai/pull/4166)
71
+ (addressing issue #4143). The Inspect maintainers judged that it better fits an
72
+ external package than Inspect core, so it is distributed here. The implementation
73
+ uses only Inspect's public API (the internal `chat_history` helper is
74
+ reimplemented locally for transcript rendering).
75
+
76
+ ## License
77
+
78
+ MIT
@@ -0,0 +1,62 @@
1
+ # inspect-claim-support
2
+
3
+ A **claim-support** (faithfulness / groundedness) scorer for
4
+ [Inspect AI](https://inspect.aisi.org.uk/), packaged as a standalone extension.
5
+
6
+ `claim_support` assesses whether a claimed answer is *actually substantiated by
7
+ the conversation transcript* — not whether it is correct in absolute terms. It is
8
+ a model-graded scorer with a rubric that maps SUPPORTED / PARTIAL / UNSUPPORTED
9
+ onto Inspect's CORRECT / PARTIAL / INCORRECT, and returns NOANSWER on a grader
10
+ parse failure.
11
+
12
+ ### Why it earns its place: absence isn't support
13
+
14
+ The rubric refuses to let *absence of evidence* pass as support. A negative claim
15
+ like "I made no network calls" only scores SUPPORTED if the transcript is actually
16
+ capable of showing that class of event. If the transcript cannot expose the
17
+ relevant events, the claim is PARTIAL or UNSUPPORTED — never SUPPORTED. This
18
+ surfaces overclaims instead of laundering them through a plausible rationale.
19
+
20
+ The scorer assesses support against the **Inspect transcript only**
21
+ (transcript-visible events), not against actual runtime truth in the environment.
22
+
23
+ ## Install
24
+
25
+ ```bash
26
+ pip install inspect-claim-support
27
+ ```
28
+
29
+ ## Use
30
+
31
+ ```python
32
+ from inspect_ai import Task
33
+ from inspect_claim_support import claim_support
34
+
35
+ task = Task(
36
+ dataset=...,
37
+ solver=...,
38
+ scorer=claim_support(), # optionally: claim_support(model="openai/gpt-4o")
39
+ )
40
+ ```
41
+
42
+ Once installed, the scorer is also resolvable by its namespaced registry name
43
+ `inspect_claim_support/claim_support` via Inspect's setuptools entry point.
44
+
45
+ ### Parameters
46
+
47
+ - `template` — grading template (defaults to a SUPPORTED / PARTIAL / UNSUPPORTED
48
+ rubric with the absence-isn't-support boundary built in).
49
+ - `model` — model to use for grading (defaults to the model being evaluated).
50
+
51
+ ## Origin & credit
52
+
53
+ This scorer originated as
54
+ [UKGovernmentBEIS/inspect_ai#4166](https://github.com/UKGovernmentBEIS/inspect_ai/pull/4166)
55
+ (addressing issue #4143). The Inspect maintainers judged that it better fits an
56
+ external package than Inspect core, so it is distributed here. The implementation
57
+ uses only Inspect's public API (the internal `chat_history` helper is
58
+ reimplemented locally for transcript rendering).
59
+
60
+ ## License
61
+
62
+ MIT
@@ -0,0 +1,3 @@
1
+ from .scorers import claim_support
2
+
3
+ __all__ = ["claim_support"]
@@ -0,0 +1,6 @@
1
+ # Single import surface that Inspect loads via the setuptools entry point.
2
+ # Importing the scorer here registers it (through the @scorer decorator) so it
3
+ # is resolvable by name once the package is installed.
4
+ from .scorers import claim_support
5
+
6
+ __all__ = ["claim_support"]
@@ -0,0 +1,176 @@
1
+ from inspect_ai.model import (
2
+ ChatMessage,
3
+ ChatMessageAssistant,
4
+ ChatMessageSystem,
5
+ ChatMessageTool,
6
+ ChatMessageUser,
7
+ Model,
8
+ get_model,
9
+ )
10
+ from inspect_ai.scorer import (
11
+ CORRECT,
12
+ INCORRECT,
13
+ NOANSWER,
14
+ PARTIAL,
15
+ Score,
16
+ Scorer,
17
+ Target,
18
+ accuracy,
19
+ scorer,
20
+ stderr,
21
+ )
22
+ from inspect_ai.solver import TaskState
23
+
24
+ DEFAULT_CLAIM_SUPPORT_TEMPLATE = """
25
+ You are assessing whether a claimed answer is actually supported by what the
26
+ conversation transcript shows. You are not judging whether the answer is correct
27
+ in general — only whether the transcript provides support for it.
28
+
29
+ Here is the transcript:
30
+
31
+ {transcript}
32
+
33
+ Here is the claimed answer being assessed:
34
+
35
+ {answer}
36
+
37
+ Assess support on this scale:
38
+ - SUPPORTED: the transcript clearly substantiates the claimed answer.
39
+ - PARTIAL: the transcript substantiates some but not all of the claimed answer,
40
+ or substantiates it only weakly or with caveats.
41
+ - UNSUPPORTED: the transcript does not substantiate the claimed answer, or
42
+ contradicts it.
43
+
44
+ Important boundary: do not treat absence of evidence as support. A claim that
45
+ something did NOT happen (e.g. "I made no network calls") is only SUPPORTED if the
46
+ transcript is actually capable of showing that class of event. If the transcript
47
+ does not expose the relevant events, the claim is PARTIAL or UNSUPPORTED — never
48
+ SUPPORTED. Assess support against what the transcript can show, not against what
49
+ you assume happened.
50
+
51
+ First reason step by step, then end with exactly one line in the form:
52
+ GRADE: SUPPORTED
53
+ GRADE: PARTIAL
54
+ GRADE: UNSUPPORTED
55
+ """.strip()
56
+
57
+
58
+ def _remove_last_match_and_after(lst, predicate):
59
+ last_match_index = max(
60
+ (i for i, x in enumerate(lst) if predicate(x)), default=-1
61
+ )
62
+ return lst[: last_match_index + 1]
63
+
64
+
65
+ def _format_tool_call(func_name: str, args_dict: dict) -> str:
66
+ # Public-API reimplementation of Inspect's internal format_function_call.
67
+ # Renders ``func_name(key=value, ...)``. The internal version adds width-based
68
+ # line wrapping for long argument lists; we keep a single-line form, which is
69
+ # behaviourally equivalent for grading (the rendered transcript is purely
70
+ # informational context for the grader model).
71
+ formatted_args = ", ".join(f"{k}={v!r}" for k, v in args_dict.items())
72
+ return f"{func_name}({formatted_args})"
73
+
74
+
75
+ def chat_history(state: TaskState) -> str:
76
+ # Reimplementation of inspect_ai.scorer._model.chat_history using only the
77
+ # public Inspect API, so this package depends on no internal module. Behaviour
78
+ # matches the original: system messages are dropped, history is cut at the
79
+ # final assistant turn, and the first message leads (it sits right after the
80
+ # template's Task/Question slot).
81
+ messages: list[ChatMessage] = [
82
+ message
83
+ for message in state.messages
84
+ if not isinstance(message, ChatMessageSystem)
85
+ ]
86
+
87
+ messages = _remove_last_match_and_after(
88
+ messages, lambda message: isinstance(message, ChatMessageAssistant)
89
+ )
90
+
91
+ history: list[str] = []
92
+ if len(messages) > 0:
93
+ history.append(messages[0].text)
94
+
95
+ for message in messages[1:]:
96
+ if isinstance(message, ChatMessageUser):
97
+ history.append(f"User: {message.text}")
98
+ elif isinstance(message, ChatMessageAssistant):
99
+ assistant_message = [message.text] if message.text else []
100
+ if message.tool_calls:
101
+ assistant_message.extend(
102
+ [
103
+ _format_tool_call(
104
+ tool_call.function, tool_call.arguments
105
+ )
106
+ for tool_call in message.tool_calls
107
+ ]
108
+ )
109
+ history.append("Assistant: " + "\n\n".join(assistant_message))
110
+ elif isinstance(message, ChatMessageTool):
111
+ history.append(
112
+ f"Tool ({message.function}): {message.tool_error or ''}{message.text}"
113
+ )
114
+
115
+ return "\n\n".join(history)
116
+
117
+
118
+ @scorer(metrics=[accuracy(), stderr()])
119
+ def claim_support(
120
+ template: str | None = None,
121
+ model: str | Model | None = None,
122
+ ) -> Scorer:
123
+ """Score whether a claimed answer is supported by the transcript.
124
+
125
+ Assesses support against the Inspect transcript only (transcript-visible
126
+ events), not against actual runtime truth in the environment.
127
+
128
+ Args:
129
+ template: Grading template (defaults to a SUPPORTED/PARTIAL/UNSUPPORTED rubric).
130
+ model: Model to use for grading (defaults to the model being evaluated).
131
+ """
132
+ grader_template = template or DEFAULT_CLAIM_SUPPORT_TEMPLATE
133
+
134
+ async def score(state: TaskState, target: Target) -> Score:
135
+ grader_model = get_model(model)
136
+ transcript = chat_history(state)
137
+ answer = state.output.completion
138
+
139
+ prompt = grader_template.replace("{transcript}", transcript).replace(
140
+ "{answer}", answer
141
+ )
142
+ result = await grader_model.generate(prompt)
143
+ grade = _parse_grade(result.completion)
144
+
145
+ if grade is None:
146
+ return Score(
147
+ value=NOANSWER,
148
+ answer=answer,
149
+ explanation=result.completion,
150
+ metadata={"grading": "PARSE_FAIL", "grader_prompt": prompt},
151
+ )
152
+
153
+ value = {
154
+ "SUPPORTED": CORRECT,
155
+ "PARTIAL": PARTIAL,
156
+ "UNSUPPORTED": INCORRECT,
157
+ }[grade]
158
+
159
+ return Score(
160
+ value=value,
161
+ answer=answer,
162
+ explanation=result.completion,
163
+ metadata={"grading": grade, "grader_prompt": prompt},
164
+ )
165
+
166
+ return score
167
+
168
+
169
+ def _parse_grade(output: str) -> str | None:
170
+ for line in reversed(output.splitlines()):
171
+ line = line.strip()
172
+ if line.startswith("GRADE:"):
173
+ token = line.removeprefix("GRADE:").strip().upper()
174
+ if token in ("SUPPORTED", "PARTIAL", "UNSUPPORTED"):
175
+ return token
176
+ return None
@@ -0,0 +1,36 @@
1
+ [build-system]
2
+ requires = ["hatchling"]
3
+ build-backend = "hatchling.build"
4
+
5
+ [project]
6
+ name = "inspect-claim-support"
7
+ version = "0.1.0"
8
+ description = "A claim-support / faithfulness scorer for Inspect AI — does the transcript actually substantiate the claimed answer?"
9
+ readme = "README.md"
10
+ requires-python = ">=3.10"
11
+ license = { text = "MIT" }
12
+ authors = [{ name = "avalyset" }]
13
+ keywords = ["inspect_ai", "scorer", "evaluation", "faithfulness", "groundedness"]
14
+ classifiers = [
15
+ "Programming Language :: Python :: 3",
16
+ "License :: OSI Approved :: MIT License",
17
+ "Operating System :: OS Independent",
18
+ ]
19
+ dependencies = ["inspect_ai"]
20
+
21
+ [project.optional-dependencies]
22
+ test = ["pytest"]
23
+
24
+ [project.entry-points.inspect_ai]
25
+ inspect_claim_support = "inspect_claim_support._registry"
26
+
27
+ [tool.hatch.build.targets.wheel]
28
+ packages = ["inspect_claim_support"]
29
+
30
+ [tool.hatch.build.targets.sdist]
31
+ include = [
32
+ "inspect_claim_support",
33
+ "tests",
34
+ "README.md",
35
+ "pyproject.toml",
36
+ ]
@@ -0,0 +1,97 @@
1
+ import pytest
2
+
3
+ from inspect_ai import Task, eval
4
+ from inspect_ai.dataset import Sample
5
+ from inspect_ai.model import ContentText, ModelOutput, get_model
6
+ from inspect_ai.scorer import CORRECT, INCORRECT, NOANSWER, PARTIAL
7
+
8
+ from inspect_claim_support import claim_support
9
+
10
+
11
+ def _mock(text: str):
12
+ return get_model(
13
+ "mockllm/model",
14
+ custom_outputs=[
15
+ ModelOutput.from_content("mockllm/model", [ContentText(text=text)])
16
+ ],
17
+ )
18
+
19
+
20
+ def _run(grader_output: str, subject_answer: str):
21
+ """Run a single-sample eval; grader and subject are independent mock models."""
22
+ task = Task(
23
+ dataset=[Sample(input="Did the run satisfy the claim?", target="")],
24
+ scorer=claim_support(model=_mock(grader_output)),
25
+ )
26
+ log = eval(task, model=_mock(subject_answer))[0]
27
+ assert log.samples
28
+ scores = log.samples[0].scores
29
+ assert scores is not None
30
+ return scores["claim_support"]
31
+
32
+
33
+ @pytest.mark.parametrize(
34
+ ["grader_output", "expected"],
35
+ [
36
+ pytest.param("Reasoning.\nGRADE: SUPPORTED", CORRECT, id="supported_correct"),
37
+ pytest.param("Reasoning.\nGRADE: PARTIAL", PARTIAL, id="partial_partial"),
38
+ pytest.param(
39
+ "Reasoning.\nGRADE: UNSUPPORTED", INCORRECT, id="unsupported_incorrect"
40
+ ),
41
+ ],
42
+ )
43
+ def test_claim_support_grade_mapping(grader_output, expected):
44
+ score = _run(grader_output, "The transcript shows the file was read.")
45
+ assert score.value == expected
46
+
47
+
48
+ def test_claim_support_parse_failure_returns_noanswer():
49
+ # No parseable GRADE: line → NOANSWER, but the subject answer must still be
50
+ # preserved on the score (matching the model_graded convention, #4025).
51
+ subject_answer = "The file was read successfully."
52
+ score = _run("I think this looks fine, but no verdict here.", subject_answer)
53
+ assert score.value == NOANSWER
54
+ assert score.answer == subject_answer
55
+ assert score.metadata is not None
56
+ assert score.metadata["grading"] == "PARSE_FAIL"
57
+
58
+
59
+ def test_claim_support_handles_literal_braces():
60
+ # Regression: the scorer fills the template with str.replace (not str.format),
61
+ # so transcript/answer containing literal { } must not raise.
62
+ subject_answer = 'Returned JSON {"calls": [{"id": 1}], "ok": true}.'
63
+ score = _run("Looks substantiated.\nGRADE: SUPPORTED", subject_answer)
64
+ assert score.value == CORRECT
65
+ assert score.answer == subject_answer
66
+
67
+
68
+ def test_claim_support_absence_boundary_reaches_grader():
69
+ # The absence-of-evidence boundary must actually reach the grader prompt, not
70
+ # just exist in the template constant — and an UNSUPPORTED verdict on an
71
+ # unprovable negative maps to INCORRECT.
72
+ score = _run(
73
+ "The transcript cannot show network activity, so this is unprovable.\n"
74
+ "GRADE: UNSUPPORTED",
75
+ "I made no network calls during this task.",
76
+ )
77
+ assert score.value == INCORRECT
78
+ assert "absence of evidence" in score.metadata["grader_prompt"]
79
+
80
+
81
+ def test_claim_support_absence_partial_maps_to_partial():
82
+ # Absence isn't support (#4143): the rubric permits PARTIAL *or* UNSUPPORTED for
83
+ # a negative claim the transcript can't substantiate — never SUPPORTED.
84
+ # Sister test test_claim_support_absence_boundary_reaches_grader already locks
85
+ # the UNSUPPORTED→INCORRECT branch for this same "no network calls" claim; this
86
+ # locks the other rubric-permitted verdict, PARTIAL→PARTIAL. Together they pin
87
+ # *both* absence-permitted grades to non-CORRECT, so neither can leak into
88
+ # CORRECT. Note: this locks the grade→score mapping, not grader fidelity (that a
89
+ # real grader honours the prompt and never returns SUPPORTED for an absence
90
+ # claim) — the latter isn't deterministically unit-testable with a mock grader.
91
+ score = _run(
92
+ "The transcript exposes no network events, so this is only weakly inferable.\n"
93
+ "GRADE: PARTIAL",
94
+ "I made no network calls during this task.",
95
+ )
96
+ assert score.value == PARTIAL
97
+ assert score.value != CORRECT