fieldkit 0.4.2__tar.gz → 0.4.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (55) hide show
  1. {fieldkit-0.4.2 → fieldkit-0.4.3}/.gitignore +2 -1
  2. {fieldkit-0.4.2 → fieldkit-0.4.3}/CHANGELOG.md +34 -0
  3. {fieldkit-0.4.2 → fieldkit-0.4.3}/PKG-INFO +1 -1
  4. {fieldkit-0.4.2 → fieldkit-0.4.3}/docs/api/eval.md +54 -0
  5. {fieldkit-0.4.2 → fieldkit-0.4.3}/pyproject.toml +1 -0
  6. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/_version.py +1 -1
  7. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/eval/__init__.py +419 -0
  8. fieldkit-0.4.3/src/fieldkit/eval/rubrics/office_action_argument.md +51 -0
  9. fieldkit-0.4.3/src/fieldkit/eval/rubrics/patent_claim_validity.md +53 -0
  10. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/eval/vertical.py +105 -3
  11. fieldkit-0.4.3/tests/eval/__init__.py +0 -0
  12. fieldkit-0.4.3/tests/eval/test_irac_structure.py +169 -0
  13. fieldkit-0.4.3/tests/eval/test_judge_backed_scorers.py +233 -0
  14. fieldkit-0.4.3/tests/eval/test_mcq_letter.py +111 -0
  15. fieldkit-0.4.3/tests/eval/test_prior_art_relevance.py +137 -0
  16. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/test_vertical_bench.py +318 -0
  17. {fieldkit-0.4.2 → fieldkit-0.4.3}/LICENSE +0 -0
  18. {fieldkit-0.4.2 → fieldkit-0.4.3}/README.md +0 -0
  19. {fieldkit-0.4.2 → fieldkit-0.4.3}/docs/api/capabilities.md +0 -0
  20. {fieldkit-0.4.2 → fieldkit-0.4.3}/docs/api/cli.md +0 -0
  21. {fieldkit-0.4.2 → fieldkit-0.4.3}/docs/api/lineage.md +0 -0
  22. {fieldkit-0.4.2 → fieldkit-0.4.3}/docs/api/nim.md +0 -0
  23. {fieldkit-0.4.2 → fieldkit-0.4.3}/docs/api/publish.md +0 -0
  24. {fieldkit-0.4.2 → fieldkit-0.4.3}/docs/api/quant.md +0 -0
  25. {fieldkit-0.4.2 → fieldkit-0.4.3}/docs/api/rag.md +0 -0
  26. {fieldkit-0.4.2 → fieldkit-0.4.3}/docs/api/training.md +0 -0
  27. {fieldkit-0.4.2 → fieldkit-0.4.3}/samples/bench-rag.py +0 -0
  28. {fieldkit-0.4.2 → fieldkit-0.4.3}/samples/feasibility-math.py +0 -0
  29. {fieldkit-0.4.2 → fieldkit-0.4.3}/samples/hello-lineage.py +0 -0
  30. {fieldkit-0.4.2 → fieldkit-0.4.3}/samples/hello-nim.py +0 -0
  31. {fieldkit-0.4.2 → fieldkit-0.4.3}/samples/naive-rag.py +0 -0
  32. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/__init__.py +0 -0
  33. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/capabilities/__init__.py +0 -0
  34. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/capabilities/data/__init__.py +0 -0
  35. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/capabilities/data/spark-capabilities.json +0 -0
  36. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/cli/__init__.py +0 -0
  37. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/lineage/__init__.py +0 -0
  38. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/nim/__init__.py +0 -0
  39. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/publish/__init__.py +0 -0
  40. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/quant/__init__.py +0 -0
  41. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/rag/__init__.py +0 -0
  42. {fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/training/__init__.py +0 -0
  43. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/__init__.py +0 -0
  44. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/conftest.py +0 -0
  45. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/test_capabilities.py +0 -0
  46. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/test_cli.py +0 -0
  47. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/test_eval.py +0 -0
  48. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/test_lineage.py +0 -0
  49. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/test_nim.py +0 -0
  50. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/test_nim_spark.py +0 -0
  51. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/test_publish.py +0 -0
  52. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/test_quant.py +0 -0
  53. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/test_rag.py +0 -0
  54. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/test_rag_spark.py +0 -0
  55. {fieldkit-0.4.2 → fieldkit-0.4.3}/tests/test_training.py +0 -0
@@ -25,7 +25,8 @@ pnpm-debug.log*
25
25
  # local-only working material (not for the public blog)
26
26
  ideas/
27
27
  HANDOFF.md
28
- .claude/
28
+ .claude/*
29
+ !.claude/skills/
29
30
 
30
31
  # transient vibe-test artifacts (Playwright screenshots written to repo root)
31
32
  .playwright-mcp/
@@ -6,6 +6,40 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and
6
6
 
7
7
  ## [Unreleased]
8
8
 
9
+ ## [0.4.3] — 2026-05-17
10
+
11
+ ### Added — `fieldkit.eval` patent-strategist scorer build-out (T6)
12
+
13
+ Four new scorers in `fieldkit.eval` round out the `format='patent-strategist'` branch landed in v0.4.2 (T4) and the `mcq_letter` promotion (T5), per `specs/patent-strategist-v1.md` §3.3:
14
+
15
+ - **`patent_claim_validity(predicted, expected, *, judge, rubric=None)`** — PatentScore-methodology 7-dim claim-validity scorer (novelty / non-obviousness / written-description / enablement / indefiniteness / subject-matter-eligibility / dependent-claim-structure). LLM-judge backed; caller supplies a `Judge(client=..., rubric=RUBRIC_PATENT_CLAIM_VALIDITY)`. Per-row `rubric` dict (e.g. `cited_prior_art`, `claim_type`) is rendered into a sorted, deterministic `Hints:` block fed to the judge as context. PatentScore methodology only — no data reuse from the cited paper (license unclear).
16
+ - **`office_action_argument(predicted, expected, *, judge, rubric=None)`** — 4-dim office-action-response scorer (rejection-type identification, statutory citation accuracy, argument structure, persuasiveness). Same `Judge`-wrapping shape; per-row hints like `rejection_type`, `required_citations`, `claim_count`, `relies_on_official_notice` flow through the `Hints:` block.
17
+ - **`irac_structure(predicted, expected="")`** — deterministic 4-checklist scorer for Patent-Bar-style IRAC responses. One regex per component (Issue / Rule / Application / Conclusion); returns `{0.0, 0.25, 0.5, 0.75, 1.0}` based on how many fire. Tolerant patterns — markdown headings, all-caps section labels, transition prose ("Whether…", "Under 35 USC 103…", "Here…", "Therefore…") all count. False positives are far less harmful than false negatives at quarter-granularity. The only T6 scorer that needs no network, so it's the one wired end-to-end through `VerticalBench` in the integration test.
18
+ - **`prior_art_relevance(predicted, expected) -> float`** — Spearman ρ on ranked prior-art lists, returning just the rho per spec §3.3. Tolerant parser accepts JSON arrays (`'["a","b","c"]'`), comma-separated, or newline-separated (with `1.`, `1)`, `- `, `* ` prefixes stripped) as well as `list[str]` directly. Missing-from-pred gold items get worst-rank padding so omissions still penalize. The paired-rank vectors are re-rankified before correlation so positional gaps from dup-skipping or padding collapse to contiguous ranks — without this, `["a","a","b","c"]` vs `["a","b","c"]` would yield ρ≈0.98 instead of the intuitive 1.0. **`prior_art_relevance_full`** returns the same rho plus an `mse_likert` field (populated only when both sides parse as numeric Likert vectors) and `n`, packaged as the frozen `PriorArtRelevanceResult` dataclass.
19
+
20
+ ### Added — rubric markdown bundled in the wheel
21
+
22
+ - **`fieldkit/src/fieldkit/eval/rubrics/{patent_claim_validity,office_action_argument}.md`** — system-prompt markdown shipped alongside the module. Loaded lazily via the new **`load_rubric(name)`** helper (and exposed via the **`RUBRIC_PATENT_CLAIM_VALIDITY`** / **`RUBRIC_OFFICE_ACTION_ARGUMENT`** module constants for the common case). `[tool.hatch.build.targets.wheel].include` extended with `src/fieldkit/eval/rubrics/*.md` so the markdown lands in the wheel.
23
+
24
+ ### Added — `fieldkit.eval.vertical` live-callable dispatch
25
+
26
+ - **`PATENT_STRATEGIST_SCORER_FNS: dict[str, Callable[..., float]]`** — companion to the existing string-keyed `PATENT_STRATEGIST_SCORERS` map. Resolves the four T6 scorers + the promoted `mcq_letter` to live functions (skips the two `judge_rubric` slots ("C", "E") which are open-ended `Judge.grade(...)` calls without a single named scorer fn). Drift-detection test asserts every fn's `__name__` matches the matching string-map entry.
27
+
28
+ ### Test suite
29
+
30
+ **+93 new tests** across three new test files + the existing vertical-bench test class:
31
+
32
+ - `tests/eval/test_irac_structure.py` — perfect / partial / per-component-detector coverage; quarter-granularity parametrize; whitespace-only / empty / expected-arg-ignored edges.
33
+ - `tests/eval/test_prior_art_relevance.py` — perfect / reversed / partial-overlap; string-parsing variants (JSON, comma, newline-numbered, bullet, paren-numbered); Likert MSE branch (perfect, off-by-one, length-mismatch fallback, non-numeric); dataclass shape (frozen, three fields); the known-value `n=4` swap (ρ=0.8) plus the dup-skip test that drove the `_rankify`-on-paired-vectors fix.
34
+ - `tests/eval/test_judge_backed_scorers.py` — `load_rubric` round-trip + missing-file error; `_format_rubric_hints` (empty / scalar / list-bullet / sorted-determinism / nested-dict JSON); both judge-backed scorers wired against a `_FakeJudge` fixture (no network) covering happy path, `None`-score fallback to `0.0`, rubric→`Hints:` threading, empty-reference collapse to `None`; signature-introspection tests ensuring `judge` and `rubric` stay keyword-only so `VerticalBench.scorer_kwargs` plumbing works.
35
+ - `tests/test_vertical_bench.py::TestPatentStrategistFormat` — 3 new tests: `PATENT_STRATEGIST_SCORER_FNS` resolves each key to the expected callable; name-map vs fn-map drift assertion; full end-to-end `VerticalBench.run` exercising `irac_structure` over a 2-row JSONL with one perfect and one half-formed IRAC response (mean accuracy = 0.75).
36
+
37
+ Total suite: **507 passed, 2 skipped** offline (`pytest -q`, `/tmp/fk` venv). The 2 skips are the long-standing `--spark`-gated live-NIM / pgvector integration tests.
38
+
39
+ ### Articles in this release
40
+
41
+ - `articles/becoming-a-patent-strategist-on-spark/` — patent-strategist v1.0 article (W3 publish target per spec §1 deliverables). T6's scorer build-out is the load-bearing dependency for the article's bench-comparison numbers; v0.4.3 is the version the article will pin against.
42
+
9
43
  ## [0.4.2] — 2026-05-15
10
44
 
11
45
  Patch release. Two card-rendering polish lifts on `fieldkit.publish` driven by the 2026-05-15 cyber-vertical cycle (`Orionfold/SecurityLLM-GGUF`, the third vertical card on this surface — zero fieldkit source changes between Saul / cyber, the v0.4.1 publishing surface generalized exactly as designed). Both lifts are additive (one new `ModelCard` field already shipped on `main` in `ff1b92f`; one new `ArtifactManifest` field added here). No new modules, no new public classes, no breaking changes — purely a tightening pass.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: fieldkit
3
- Version: 0.4.2
3
+ Version: 0.4.3
4
4
  Summary: Verified-on-Spark patterns lifted from the ai-field-notes blog into one importable Python package.
5
5
  Project-URL: Homepage, https://ainative.business/fieldkit/
6
6
  Project-URL: Source, https://github.com/manavsehgal/ai-field-notes/tree/main/fieldkit
@@ -54,6 +54,14 @@ from fieldkit.eval import (
54
54
  # v0.4.x — vertical-curator surface
55
55
  VerticalBench, VerticalQA,
56
56
  contains, exact_match, numeric_match,
57
+
58
+ # v0.4.3 — patent-strategist scorers
59
+ mcq_letter,
60
+ irac_structure,
61
+ prior_art_relevance, prior_art_relevance_full, PriorArtRelevanceResult,
62
+ patent_claim_validity, office_action_argument,
63
+ RUBRIC_PATENT_CLAIM_VALIDITY, RUBRIC_OFFICE_ACTION_ARGUMENT,
64
+ load_rubric,
57
65
  )
58
66
  ```
59
67
 
@@ -292,6 +300,52 @@ numeric_match("Revenue was $4.55B", "4.5B",
292
300
  | `contains(p, e)` | The model is asked to answer in prose and the reference is a key fact/number/phrase that must appear somewhere in the answer. |
293
301
  | `numeric_match(p, e, *, rel_tolerance=0.01)` | FinanceBench-style quantitative answers. Extracts the first number from each side (commas stripped), compares under relative tolerance. Defaults to ±1% per FinanceBench's grading convention. Returns 0.0 if either side has no parseable number — including refusals, so the refusal counter elsewhere doesn't need to gate this scorer. |
294
302
 
303
+ ### Patent-strategist scorers *(v0.4.3)*
304
+
305
+ Five scorers + two rubric constants land in v0.4.3 to round out the `format='patent-strategist'` branch of `VerticalBench`. Wire them through `VerticalBench(scorer=…, scorer_kwargs=…)` or import the live-callable dispatch map at `fieldkit.eval.vertical.PATENT_STRATEGIST_SCORER_FNS`. The 1-paragraph-per-scorer cheat sheet:
306
+
307
+ #### `mcq_letter(predicted, expected, *, strip_think=True) -> float`
308
+
309
+ MCQ letter scorer promoted from `scripts/g3_*.py` after three vertical-bench reuses (cybermetric, medmcqa, patent-strategist). Decision order: stripped one-letter (`"B"`), then `"answer: X"` / `"answer is X"` / `"option X"` / `"choice X"`, then first word-bounded `[A-D]`. Case-insensitive throughout. When `strip_think=True` (default), `<think>...</think>` blocks are regex-stripped *before* the three-step decision — keeps reasoning-trace verbosity on R1-distill family models from polluting the letter pick. The flag is a no-op regex on cyber/medical text without `<think>` tags, so existing callers flip the default on safely.
310
+
311
+ #### `irac_structure(predicted, expected="") -> float`
312
+
313
+ Deterministic 4-checklist Patent-Bar IRAC detector. Returns one of `{0.0, 0.25, 0.5, 0.75, 1.0}` based on Issue / Rule / Application / Conclusion regex hits. Tolerant patterns: markdown headings, all-caps section labels, transition prose (`"Whether…"`, `"Under 35 USC 103…"`, `"Here…"`, `"Therefore…"`) all count. `expected` is ignored — the scorer measures structural form, not factual agreement; kept in the signature for `VerticalBench` compatibility. False positives are far less harmful than false negatives at this granularity; the score's job is to flag *structural absence*, not grade rhetorical polish.
314
+
315
+ #### `prior_art_relevance(predicted, expected) -> float`
316
+
317
+ Spearman ρ between predicted and gold prior-art rankings — the bench-facing scalar per `specs/patent-strategist-v1.md` §3.3. Accepts `list[str]` directly or a tolerant string parse (JSON arrays `'["a","b","c"]'`, comma-separated `"a, b, c"`, or newline-separated with `1.` / `1)` / `- ` / `* ` prefixes stripped). Items missing from `predicted` get worst-rank padding so omissions still penalize. The paired-rank vectors get re-rankified before correlation so positional gaps from dup-skipping or padding collapse to contiguous ranks — without this, `["a","a","b","c"]` vs `["a","b","c"]` would yield ρ≈0.98 instead of 1.0.
318
+
319
+ #### `prior_art_relevance_full(predicted, expected) -> PriorArtRelevanceResult`
320
+
321
+ Returns the same ρ plus an `mse_likert` field (populated only when both sides parse as numeric Likert vectors, e.g. `"5,4,3,2,1"`) and an `n` count, packaged as a frozen `PriorArtRelevanceResult(spearman_rho, mse_likert, n)` dataclass. The bench surface uses `prior_art_relevance` because the scorer contract is `Callable[..., float]`; this full variant is for callers that want both metrics in a single pass.
322
+
323
+ #### `patent_claim_validity(predicted, expected, *, judge, rubric=None) -> float`
324
+
325
+ PatentScore-methodology 7-dim claim-validity scorer (novelty / non-obviousness / written-description / enablement / indefiniteness / subject-matter-eligibility / dependent-claim-structure). LLM-judge backed; caller supplies a `Judge` instance constructed with `rubric=RUBRIC_PATENT_CLAIM_VALIDITY`. Per-row `rubric` dict (convention keys: `cited_prior_art`, `claim_type`, `dependency_target`, `statutory_focus`) renders into a deterministic sorted `Hints:` block fed to the judge as context. Returns the parsed score, mapping `None` → `0.0` so bench accuracy-averaging stays well-defined. **PatentScore methodology only — no data reuse from the cited paper** (license unclear).
326
+
327
+ ```python
328
+ from fieldkit.eval import Judge, RUBRIC_PATENT_CLAIM_VALIDITY, patent_claim_validity
329
+ from fieldkit.nim import NIMClient
330
+
331
+ with NIMClient(base_url="http://localhost:8000/v1", model="...") as c:
332
+ judge = Judge(client=c, rubric=RUBRIC_PATENT_CLAIM_VALIDITY)
333
+ score = patent_claim_validity(
334
+ predicted_claim_text,
335
+ reference_claim_text,
336
+ judge=judge,
337
+ rubric={"cited_prior_art": ["US10987654", "US20210123456"]},
338
+ )
339
+ ```
340
+
341
+ #### `office_action_argument(predicted, expected, *, judge, rubric=None) -> float`
342
+
343
+ 4-dim office-action-response scorer (rejection-type identification, statutory citation accuracy, argument structure, persuasiveness). Same `Judge`-wrapping shape as `patent_claim_validity`; pair with `RUBRIC_OFFICE_ACTION_ARGUMENT`. Convention rubric keys: `rejection_type` (`102` / `103` / `112(a)` / `112(b)` / `101` / `double-patenting` / `restriction`), `required_citations` (list of expected MPEP/CFR/case cites), `claim_count`, `relies_on_official_notice`.
344
+
345
+ #### Rubric loader: `load_rubric(name) -> str`
346
+
347
+ The two `RUBRIC_PATENT_CLAIM_VALIDITY` and `RUBRIC_OFFICE_ACTION_ARGUMENT` module constants are populated at import time from markdown files shipped under `fieldkit/eval/rubrics/`. Pass `load_rubric("patent_claim_validity")` to re-read the file (or your own rubric named `my_rubric.md` if you ship a fork). The `[tool.hatch.build.targets.wheel].include` glob ships `*.md` under that subtree, so the rubrics travel with the wheel.
348
+
295
349
  ## Samples
296
350
 
297
351
  - [`samples/bench-rag.py`](https://github.com/manavsehgal/ai-field-notes/blob/main/fieldkit/samples/bench-rag.py) — offline `Bench` + `Judge.parse` walkthrough.
@@ -59,6 +59,7 @@ packages = ["src/fieldkit"]
59
59
  include = [
60
60
  "src/fieldkit/**/*.py",
61
61
  "src/fieldkit/**/data/*.json",
62
+ "src/fieldkit/eval/rubrics/*.md",
62
63
  ]
63
64
 
64
65
  [tool.hatch.build.targets.sdist]
@@ -6,4 +6,4 @@
6
6
  build time, so bumping it here is enough to bump the wheel version too.
7
7
  """
8
8
 
9
- __version__ = "0.4.2"
9
+ __version__ = "0.4.3"
@@ -45,6 +45,8 @@ __all__ = [
45
45
  "REFUSAL_PATTERNS",
46
46
  "RUBRIC_CORRECTNESS",
47
47
  "RUBRIC_FAITHFULNESS",
48
+ "RUBRIC_OFFICE_ACTION_ARGUMENT",
49
+ "RUBRIC_PATENT_CLAIM_VALIDITY",
48
50
  "RUBRIC_RELEVANCE",
49
51
  "AgentRun",
50
52
  "AssertionGrader",
@@ -60,6 +62,7 @@ __all__ = [
60
62
  "MatchedBaseComparisonResult",
61
63
  "PassAtK",
62
64
  "PassAtKResult",
65
+ "PriorArtRelevanceResult",
63
66
  "Trajectory",
64
67
  "TrajectoryIter",
65
68
  "TurnDetail",
@@ -67,9 +70,16 @@ __all__ = [
67
70
  "VerticalQA",
68
71
  "contains",
69
72
  "exact_match",
73
+ "irac_structure",
70
74
  "is_refusal",
75
+ "load_rubric",
76
+ "mcq_letter",
71
77
  "numeric_match",
78
+ "office_action_argument",
72
79
  "pass_at_k_estimator",
80
+ "patent_claim_validity",
81
+ "prior_art_relevance",
82
+ "prior_art_relevance_full",
73
83
  "summarize_agent_runs",
74
84
  "summarize_metric",
75
85
  ]
@@ -108,6 +118,415 @@ def is_refusal(text: str | None) -> bool:
108
118
  return any(p.search(text) for p in REFUSAL_PATTERNS)
109
119
 
110
120
 
121
+ # --- MCQ letter scorer ---------------------------------------------------
122
+
123
+ _MCQ_AFTER_ANSWER_RE = re.compile(
124
+ r"\b(?:answer|choice|option)\b[^A-Za-z0-9]{0,20}([A-D])\b",
125
+ re.IGNORECASE,
126
+ )
127
+ _MCQ_BOUNDED_RE = re.compile(r"\b([A-D])\b", re.IGNORECASE)
128
+ _THINK_BLOCK_RE = re.compile(r"<think>.*?</think>", re.DOTALL)
129
+
130
+
131
+ def mcq_letter(
132
+ predicted: str,
133
+ expected: str,
134
+ *,
135
+ strip_think: bool = True,
136
+ ) -> float:
137
+ """Score MCQ letter responses, with `<think>`-aware extraction for
138
+ reasoning models.
139
+
140
+ Promoted to `fieldkit.eval` after three vertical-bench reuses
141
+ (cybermetric, medmcqa, patent-strategist). Decision order:
142
+ (a) stripped one-letter output ("B"); (b) "answer: X" / "answer is X" /
143
+ "option X" / "choice X" with X in [A-D]; (c) first word-bounded [A-D]
144
+ in the response. Case-insensitive throughout.
145
+
146
+ When `strip_think=True` (default), `<think>...</think>` blocks are
147
+ regex-stripped from `predicted` *before* the three-step decision —
148
+ keeps reasoning-trace verbosity from polluting the letter pick on
149
+ R1-distill family models. No-op on text without `<think>` tags, so
150
+ cyber/medical callers can flip the default on safely.
151
+ """
152
+ pred = (predicted or "")
153
+ if strip_think:
154
+ pred = _THINK_BLOCK_RE.sub("", pred)
155
+ pred = pred.strip()
156
+ exp = (expected or "").strip().upper()
157
+ if not pred or exp not in ("A", "B", "C", "D"):
158
+ return 0.0
159
+ stripped = pred.upper().strip(".,)!:- ")
160
+ if len(stripped) <= 1 and stripped in ("A", "B", "C", "D"):
161
+ return 1.0 if stripped == exp else 0.0
162
+ m = _MCQ_AFTER_ANSWER_RE.search(pred)
163
+ if m:
164
+ return 1.0 if m.group(1).upper() == exp else 0.0
165
+ m = _MCQ_BOUNDED_RE.search(pred)
166
+ if m:
167
+ return 1.0 if m.group(1).upper() == exp else 0.0
168
+ return 0.0
169
+
170
+
171
+ # --- Patent-strategist specialty scorers ---------------------------------
172
+ # Added in v0.4.3 alongside the `format='patent-strategist'` branch of
173
+ # `VerticalBench`. Two are LLM-judge-backed (`patent_claim_validity`,
174
+ # `office_action_argument` — they wrap a caller-supplied `Judge`), one is
175
+ # deterministic regex-checklist (`irac_structure`), and one is a pure
176
+ # Spearman-rank reducer over ranked prior-art lists (`prior_art_relevance`).
177
+ # Per `specs/patent-strategist-v1.md` §3.3. Rubric markdown ships alongside
178
+ # this module under `rubrics/` and loads via `load_rubric()`.
179
+
180
+
181
+ def load_rubric(name: str) -> str:
182
+ """Return the text of a rubric markdown file shipped with `fieldkit.eval`.
183
+
184
+ `name` is the bare filename without extension — e.g. ``"patent_claim_validity"``
185
+ resolves to ``fieldkit/eval/rubrics/patent_claim_validity.md``. The file
186
+ is read once per call (small files, no caching needed for the wheel-bundled
187
+ surface).
188
+ """
189
+ rubrics_dir = Path(__file__).parent / "rubrics"
190
+ path = rubrics_dir / f"{name}.md"
191
+ return path.read_text(encoding="utf-8")
192
+
193
+
194
+ RUBRIC_PATENT_CLAIM_VALIDITY: str = load_rubric("patent_claim_validity")
195
+ """System prompt for the PatentScore-methodology 7-dim claim-validity rubric.
196
+
197
+ Pass to `Judge(client=..., rubric=RUBRIC_PATENT_CLAIM_VALIDITY)` and feed
198
+ into `patent_claim_validity(predicted, expected, judge=<that judge>)`."""
199
+
200
+ RUBRIC_OFFICE_ACTION_ARGUMENT: str = load_rubric("office_action_argument")
201
+ """System prompt for the 4-dim office-action-response rubric (rejection-type
202
+ ID, statutory citation accuracy, argument structure, persuasiveness)."""
203
+
204
+
205
+ def _format_rubric_hints(rubric: dict[str, Any] | None) -> str:
206
+ """Render a per-row rubric dict as a stable `Hints:` block.
207
+
208
+ Keys are sorted for determinism; list values get bullet-rendered; nested
209
+ objects fall through to `json.dumps`. Returns ``""`` on empty/None input.
210
+ """
211
+ if not rubric:
212
+ return ""
213
+ lines = ["Hints:"]
214
+ for k in sorted(rubric.keys()):
215
+ v = rubric[k]
216
+ if isinstance(v, (list, tuple)):
217
+ lines.append(f"- {k}:")
218
+ for item in v:
219
+ lines.append(f" - {item}")
220
+ elif isinstance(v, dict):
221
+ lines.append(f"- {k}: {json.dumps(v, sort_keys=True)}")
222
+ else:
223
+ lines.append(f"- {k}: {v}")
224
+ return "\n".join(lines)
225
+
226
+
227
+ def patent_claim_validity(
228
+ predicted: str,
229
+ expected: str,
230
+ *,
231
+ judge: Judge,
232
+ rubric: dict[str, Any] | None = None,
233
+ ) -> float:
234
+ """Score a predicted patent claim against a reference via the PatentScore
235
+ 7-dim rubric (`RUBRIC_PATENT_CLAIM_VALIDITY`).
236
+
237
+ Caller supplies a `Judge` instance constructed with
238
+ `rubric=RUBRIC_PATENT_CLAIM_VALIDITY`; this function feeds the prediction
239
+ + reference (+ optional per-row `rubric` dict rendered as `Hints:`) into
240
+ `judge.grade()` and returns the parsed score, mapping ``None`` →
241
+ ``0.0`` so the bench's accuracy-averaging stays well-defined.
242
+
243
+ Per-row `rubric` keys are convention rather than enforcement — typical
244
+ examples include ``cited_prior_art``, ``claim_type``
245
+ (``independent`` / ``dependent``), ``dependency_target``,
246
+ ``statutory_focus`` (e.g. ``["102", "103"]``).
247
+ """
248
+ hints = _format_rubric_hints(rubric)
249
+ context: str | None = hints or None
250
+ result = judge.grade(
251
+ prediction=predicted,
252
+ reference=expected or None,
253
+ context=context,
254
+ )
255
+ return result.score if result.score is not None else 0.0
256
+
257
+
258
+ def office_action_argument(
259
+ predicted: str,
260
+ expected: str,
261
+ *,
262
+ judge: Judge,
263
+ rubric: dict[str, Any] | None = None,
264
+ ) -> float:
265
+ """Score an attorney's predicted office-action response via the 4-dim
266
+ rubric (`RUBRIC_OFFICE_ACTION_ARGUMENT`).
267
+
268
+ Caller supplies a `Judge` constructed with that rubric; `predicted` is
269
+ the attorney response text, `expected` is the reference response (or
270
+ empty when the row's gold is a rubric dict only). Per-row `rubric`
271
+ convention keys: ``rejection_type`` (``102`` / ``103`` / ``112(a)`` /
272
+ ``112(b)`` / ``101`` / ``double-patenting`` / ``restriction``),
273
+ ``required_citations`` (list of expected MPEP/CFR/case cites),
274
+ ``claim_count``, ``relies_on_official_notice`` (bool).
275
+ """
276
+ hints = _format_rubric_hints(rubric)
277
+ context: str | None = hints or None
278
+ result = judge.grade(
279
+ prediction=predicted,
280
+ reference=expected or None,
281
+ context=context,
282
+ )
283
+ return result.score if result.score is not None else 0.0
284
+
285
+
286
+ # IRAC structure detector — one regex per component. Each pattern matches if
287
+ # the predicted response signals the component's presence via a section
288
+ # heading, transition phrase, or canonical lead-in. Patterns are deliberately
289
+ # tolerant (markdown / plain prose / numbered lists all hit) — false positives
290
+ # are far less harmful than false negatives at the 0.25-granularity we report.
291
+ _IRAC_PATTERNS: dict[str, re.Pattern[str]] = {
292
+ "issue": re.compile(
293
+ r"\b(?:issue\s*[:\s]|the\s+issue\b|question\s+presented\b|whether\b)",
294
+ re.IGNORECASE,
295
+ ),
296
+ "rule": re.compile(
297
+ r"(?:\brule\b|\bthe\s+rule\b|"
298
+ r"\bunder\s+(?:35\s+u\.?s\.?c\.?|the\s+(?:statute|law|holding|standard|mpep))|"
299
+ r"\bmpep\s+(?:§|sec(?:tion)?\.?)?\s*\d|"
300
+ r"\b35\s+u\.?s\.?c\.?\s*(?:§|sec(?:tion)?\.?)?\s*\d{2,3}|"
301
+ r"\blaw\s+(?:provides|states|requires)\b|"
302
+ r"\bholding\b)",
303
+ re.IGNORECASE,
304
+ ),
305
+ "application": re.compile(
306
+ r"(?:\bapplication\b|\bapplying\b|\bhere[,\s]|\bin\s+this\s+case\b|"
307
+ r"\bin\s+the\s+(?:present|instant)\b|"
308
+ r"\bthe\s+(?:present|instant)\s+(?:claim|application|invention)\b|"
309
+ r"\bas\s+applied\b|\bapplied\s+to\b)",
310
+ re.IGNORECASE,
311
+ ),
312
+ "conclusion": re.compile(
313
+ r"(?:\bconclusion\b|\bin\s+conclusion\b|\btherefore\b|\baccordingly\b|"
314
+ r"\bthus[,\s]|"
315
+ r"\bfor\s+(?:the\s+)?(?:above|foregoing)\s+reasons\b|"
316
+ r"\bin\s+sum\b|"
317
+ r"\bthe\s+(?:examiner|applicant)\s+(?:should|is\s+respectfully\s+requested)\b)",
318
+ re.IGNORECASE,
319
+ ),
320
+ }
321
+
322
+
323
+ def irac_structure(predicted: str, expected: str = "") -> float:
324
+ """Deterministic 4-checklist IRAC structure scorer.
325
+
326
+ Returns one of ``{0.0, 0.25, 0.5, 0.75, 1.0}`` — one quarter per IRAC
327
+ component (Issue / Rule / Application / Conclusion) detected via
328
+ regex on `predicted`. `expected` is ignored (the scorer measures
329
+ structural form, not factual agreement) but kept in the signature
330
+ for `VerticalBench` compatibility.
331
+
332
+ Detection patterns are deliberately tolerant: markdown headings,
333
+ plain-prose transitions, and canonical Patent-Bar lead-ins
334
+ ("Whether…", "Under 35 USC 103…", "Here…", "Therefore…") all count.
335
+ False positives are far less harmful than false negatives at this
336
+ granularity — the score's job is to flag *structural absence*, not
337
+ to grade rhetorical polish.
338
+ """
339
+ if not predicted:
340
+ return 0.0
341
+ hits = sum(1 for pat in _IRAC_PATTERNS.values() if pat.search(predicted))
342
+ return hits / 4.0
343
+
344
+
345
+ @dataclass(frozen=True, slots=True)
346
+ class PriorArtRelevanceResult:
347
+ """Full result from `prior_art_relevance_full` — both rank-correlation and
348
+ Likert-MSE figures. The bench-facing `prior_art_relevance` scorer
349
+ surfaces just `spearman_rho` per spec §3.3; callers who need both
350
+ metrics import this dataclass directly.
351
+ """
352
+
353
+ spearman_rho: float
354
+ mse_likert: float | None
355
+ n: int
356
+
357
+
358
+ _RANKED_LIST_RE = re.compile(r"^[\s\-\*\d.\)]*([^\s].*?)\s*$")
359
+
360
+
361
+ def _parse_ranked_list(s: str) -> list[str]:
362
+ """Tolerant parser: JSON array, comma-separated, or newline-separated.
363
+
364
+ Strips numeric / bullet / paren prefixes ("1.", "1)", "- ", "* ") so
365
+ a model can emit "1. doc-a\n2. doc-b" and the scorer recovers
366
+ ``["doc-a", "doc-b"]``. Empty/whitespace lines are dropped.
367
+ """
368
+ raw = (s or "").strip()
369
+ if not raw:
370
+ return []
371
+ if raw.startswith("["):
372
+ try:
373
+ arr = json.loads(raw)
374
+ if isinstance(arr, list):
375
+ return [str(x).strip() for x in arr if str(x).strip()]
376
+ except json.JSONDecodeError:
377
+ pass
378
+ sep = "\n" if "\n" in raw else ","
379
+ items: list[str] = []
380
+ for chunk in raw.split(sep):
381
+ chunk = chunk.strip()
382
+ if not chunk:
383
+ continue
384
+ m = _RANKED_LIST_RE.match(chunk)
385
+ if m:
386
+ items.append(m.group(1).strip())
387
+ else:
388
+ items.append(chunk)
389
+ return items
390
+
391
+
392
+ def _spearman_rho(ranks_a: list[float], ranks_b: list[float]) -> float:
393
+ """Spearman rank correlation. Both inputs must be equal-length rank
394
+ vectors (already converted from raw values to ranks). Returns 0.0 for
395
+ n<2 (insufficient signal to correlate)."""
396
+ n = len(ranks_a)
397
+ if n < 2 or len(ranks_b) != n:
398
+ return 0.0
399
+ mean_a = sum(ranks_a) / n
400
+ mean_b = sum(ranks_b) / n
401
+ cov = sum((ranks_a[i] - mean_a) * (ranks_b[i] - mean_b) for i in range(n))
402
+ var_a = sum((r - mean_a) ** 2 for r in ranks_a)
403
+ var_b = sum((r - mean_b) ** 2 for r in ranks_b)
404
+ denom = math.sqrt(var_a * var_b)
405
+ if denom == 0.0:
406
+ return 0.0
407
+ return cov / denom
408
+
409
+
410
+ def prior_art_relevance_full(
411
+ predicted: str | list[str],
412
+ expected: str | list[str],
413
+ ) -> PriorArtRelevanceResult:
414
+ """Compute Spearman ρ (and Likert MSE when both sides are numeric) between
415
+ a predicted ranked list of prior-art IDs and a gold ranked list.
416
+
417
+ Accepts list-of-str directly or a string (JSON / comma / newline). Items
418
+ in `predicted` that don't appear in `expected` are dropped; items in
419
+ `expected` that don't appear in `predicted` are assigned the worst
420
+ available rank (so omissions still penalize). Comparison is on the
421
+ overlap, padded for missing items.
422
+
423
+ `mse_likert` is computed only when both sides parse as numeric Likert
424
+ scores (1-5 relevance ratings rather than ID lists); otherwise it's
425
+ ``None``. The bench-facing `prior_art_relevance` surface uses
426
+ `spearman_rho` per spec §3.3.
427
+ """
428
+ pred_items = predicted if isinstance(predicted, list) else _parse_ranked_list(predicted)
429
+ gold_items = expected if isinstance(expected, list) else _parse_ranked_list(expected)
430
+ pred_items = [str(x) for x in pred_items]
431
+ gold_items = [str(x) for x in gold_items]
432
+
433
+ # Likert MSE branch — both sides parse as numbers and same length.
434
+ pred_nums = _try_parse_numeric_list(pred_items)
435
+ gold_nums = _try_parse_numeric_list(gold_items)
436
+ mse_likert: float | None
437
+ if pred_nums is not None and gold_nums is not None and len(pred_nums) == len(gold_nums) and pred_nums:
438
+ n = len(pred_nums)
439
+ mse_likert = sum((pred_nums[i] - gold_nums[i]) ** 2 for i in range(n)) / n
440
+ # When both sides are Likert vectors, Spearman runs on the numbers themselves.
441
+ rho = _spearman_rho(_rankify(pred_nums), _rankify(gold_nums))
442
+ return PriorArtRelevanceResult(spearman_rho=rho, mse_likert=mse_likert, n=n)
443
+
444
+ mse_likert = None
445
+
446
+ # ID-list branch — rank by position; missing-from-pred → worst available.
447
+ if not gold_items:
448
+ return PriorArtRelevanceResult(spearman_rho=0.0, mse_likert=None, n=0)
449
+
450
+ # Gold rank: 1 = best, N = worst.
451
+ gold_rank: dict[str, int] = {item: i + 1 for i, item in enumerate(gold_items)}
452
+ n_gold = len(gold_items)
453
+
454
+ # Predicted rank for each gold item: position in `pred_items`, or n_gold+1
455
+ # (worst-plus-one) if absent.
456
+ pred_rank: dict[str, int] = {}
457
+ seen: set[str] = set()
458
+ for i, item in enumerate(pred_items):
459
+ if item not in seen:
460
+ pred_rank[item] = i + 1
461
+ seen.add(item)
462
+
463
+ paired_gold: list[float] = []
464
+ paired_pred: list[float] = []
465
+ worst = n_gold + 1
466
+ for item in gold_items:
467
+ paired_gold.append(float(gold_rank[item]))
468
+ paired_pred.append(float(pred_rank.get(item, worst)))
469
+
470
+ # Re-rank both vectors before Spearman so positional gaps (from
471
+ # duplicates skipped in pred, or worst-rank padding) collapse to
472
+ # contiguous 1..N ranks. Without this, ["a","a","b","c"] vs ["a","b","c"]
473
+ # would produce pred-rank-vector [1,3,4] vs gold [1,2,3] and yield ρ≈0.98
474
+ # instead of the intuitive 1.0 (the pred order *is* monotonic in gold).
475
+ rho = _spearman_rho(_rankify(paired_pred), _rankify(paired_gold))
476
+ return PriorArtRelevanceResult(spearman_rho=rho, mse_likert=mse_likert, n=n_gold)
477
+
478
+
479
+ def prior_art_relevance(
480
+ predicted: str | list[str],
481
+ expected: str | list[str],
482
+ ) -> float:
483
+ """Spearman ρ between predicted and gold prior-art rankings.
484
+
485
+ Bench-facing wrapper around `prior_art_relevance_full` — returns just
486
+ `spearman_rho` so it slots into `VerticalBench`'s
487
+ ``Callable[..., float]`` scorer contract. ρ ranges over ``[-1.0, 1.0]``;
488
+ bench averages it across questions per spec §3.3.
489
+
490
+ See `prior_art_relevance_full` for input parsing rules (JSON array,
491
+ comma-separated, newline-separated all accepted; list[str] accepted
492
+ directly) and the Likert-MSE second metric.
493
+ """
494
+ return prior_art_relevance_full(predicted, expected).spearman_rho
495
+
496
+
497
+ def _try_parse_numeric_list(items: list[str]) -> list[float] | None:
498
+ """Return a list of floats if every item parses as a number, else None."""
499
+ if not items:
500
+ return None
501
+ out: list[float] = []
502
+ for x in items:
503
+ try:
504
+ out.append(float(x))
505
+ except (TypeError, ValueError):
506
+ return None
507
+ return out
508
+
509
+
510
+ def _rankify(values: list[float]) -> list[float]:
511
+ """Convert a value vector to its average-rank vector (1-indexed).
512
+
513
+ Ties get the mean of the ranks they span — standard Spearman tie-handling.
514
+ """
515
+ n = len(values)
516
+ indexed = sorted(range(n), key=lambda i: values[i])
517
+ ranks = [0.0] * n
518
+ i = 0
519
+ while i < n:
520
+ j = i
521
+ while j + 1 < n and values[indexed[j + 1]] == values[indexed[i]]:
522
+ j += 1
523
+ avg_rank = (i + j) / 2.0 + 1.0
524
+ for k in range(i, j + 1):
525
+ ranks[indexed[k]] = avg_rank
526
+ i = j + 1
527
+ return ranks
528
+
529
+
111
530
  # --- Bench ---------------------------------------------------------------
112
531
 
113
532
 
@@ -0,0 +1,51 @@
1
+ You are an impartial patent-prosecution grader scoring an attorney's
2
+ *predicted* office-action response against a *reference* response (or a
3
+ rubric specifying the expected rejection type + key citations) on 4
4
+ dimensions, each 0-1. Return the arithmetic mean as `score`.
5
+
6
+ **Dimensions**
7
+
8
+ 1. **Rejection-type identification** — does the response correctly identify
9
+ the rejection's statutory basis? Valid types include §101 (subject matter
10
+ eligibility), §102 (anticipation), §103 (obviousness), §112(a) (written
11
+ description / enablement), §112(b) (indefiniteness), double-patenting
12
+ (statutory or obviousness-type), and restriction requirements. 1.0 =
13
+ correct type AND statutory subsection cited · 0.5 = type correct,
14
+ subsection wrong or missing · 0.0 = wrong type.
15
+
16
+ 2. **Statutory citation accuracy** — are all cited statutes, CFR rules, and
17
+ MPEP sections accurate (correct section + subsection numbering, no
18
+ fabricated citations)? Penalize hallucinated MPEP sections (very common
19
+ failure mode for under-trained models). 1.0 = all cites accurate ·
20
+ 0.5 = mostly correct with one minor error · 0.0 = fabricated or
21
+ substantially wrong cites.
22
+
23
+ 3. **Argument structure** — does the response follow the canonical
24
+ prosecution response shape: (a) restate the rejection, (b) traverse with
25
+ reasoning grounded in case law / MPEP, (c) propose amendment if needed
26
+ with support citation, (d) summary statement / request for allowance?
27
+ Penalize bare denials, amendments without §112(a) support pointers, and
28
+ missing claim-by-claim treatment when the rejection lists multiple claims.
29
+
30
+ 4. **Persuasiveness** — would an examiner read this and find it credible
31
+ enough to merit withdrawing the rejection (or, at minimum, issuing a Final
32
+ with substantive new analysis rather than restating)? Score the *technical
33
+ substance* of the argument, not its tone. Penalize attorney-argument-only
34
+ responses (without evidence/declarations) where the rejection rests on
35
+ official notice that requires rebuttal evidence.
36
+
37
+ **Inputs**
38
+
39
+ The user message provides the predicted response text, the reference text
40
+ (if any), and optionally per-row rubric hints (e.g. expected rejection_type,
41
+ required_citations list, claim_count) under a `Hints:` heading. Use the
42
+ hints to anchor scoring; missing reference = score on facial quality and note
43
+ the assumption.
44
+
45
+ **Output**
46
+
47
+ Return ONLY a JSON object:
48
+
49
+ ```json
50
+ {"score": 0.65, "rationale": "Identifies §103 correctly and cites MPEP 2143 (KSR rationales). Weakest on argument-structure (no §112(a) support pointer for the proposed amendment) and persuasiveness (relies on attorney argument alone where rejection cites official notice — declaration would strengthen)."}
51
+ ```