PyPI - fieldkit - Versions diffs - 0.4.2__tar.gz → 0.4.3__tar.gz - Mend

fieldkit 0.4.2tar.gz → 0.4.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

{fieldkit-0.4.2 → fieldkit-0.4.3}/.gitignore RENAMED Viewed

@@ -25,7 +25,8 @@ pnpm-debug.log*
 # local-only working material (not for the public blog)
 ideas/
 HANDOFF.md
-.claude/
+.claude/*
+!.claude/skills/
 # transient vibe-test artifacts (Playwright screenshots written to repo root)
 .playwright-mcp/

{fieldkit-0.4.2 → fieldkit-0.4.3}/CHANGELOG.md RENAMED Viewed

@@ -6,6 +6,40 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and
 ## [Unreleased]
+## [0.4.3] — 2026-05-17
+### Added — `fieldkit.eval` patent-strategist scorer build-out (T6)
+Four new scorers in `fieldkit.eval` round out the `format='patent-strategist'` branch landed in v0.4.2 (T4) and the `mcq_letter` promotion (T5), per `specs/patent-strategist-v1.md` §3.3:
+- **`patent_claim_validity(predicted, expected, *, judge, rubric=None)`** — PatentScore-methodology 7-dim claim-validity scorer (novelty / non-obviousness / written-description / enablement / indefiniteness / subject-matter-eligibility / dependent-claim-structure). LLM-judge backed; caller supplies a `Judge(client=..., rubric=RUBRIC_PATENT_CLAIM_VALIDITY)`. Per-row `rubric` dict (e.g. `cited_prior_art`, `claim_type`) is rendered into a sorted, deterministic `Hints:` block fed to the judge as context. PatentScore methodology only — no data reuse from the cited paper (license unclear).
+- **`office_action_argument(predicted, expected, *, judge, rubric=None)`** — 4-dim office-action-response scorer (rejection-type identification, statutory citation accuracy, argument structure, persuasiveness). Same `Judge`-wrapping shape; per-row hints like `rejection_type`, `required_citations`, `claim_count`, `relies_on_official_notice` flow through the `Hints:` block.
+- **`irac_structure(predicted, expected="")`** — deterministic 4-checklist scorer for Patent-Bar-style IRAC responses. One regex per component (Issue / Rule / Application / Conclusion); returns `{0.0, 0.25, 0.5, 0.75, 1.0}` based on how many fire. Tolerant patterns — markdown headings, all-caps section labels, transition prose ("Whether…", "Under 35 USC 103…", "Here…", "Therefore…") all count. False positives are far less harmful than false negatives at quarter-granularity. The only T6 scorer that needs no network, so it's the one wired end-to-end through `VerticalBench` in the integration test.
+- **`prior_art_relevance(predicted, expected) -> float`** — Spearman ρ on ranked prior-art lists, returning just the rho per spec §3.3. Tolerant parser accepts JSON arrays (`'["a","b","c"]'`), comma-separated, or newline-separated (with `1.`, `1)`, `- `, `* ` prefixes stripped) as well as `list[str]` directly. Missing-from-pred gold items get worst-rank padding so omissions still penalize. The paired-rank vectors are re-rankified before correlation so positional gaps from dup-skipping or padding collapse to contiguous ranks — without this, `["a","a","b","c"]` vs `["a","b","c"]` would yield ρ≈0.98 instead of the intuitive 1.0. **`prior_art_relevance_full`** returns the same rho plus an `mse_likert` field (populated only when both sides parse as numeric Likert vectors) and `n`, packaged as the frozen `PriorArtRelevanceResult` dataclass.
+### Added — rubric markdown bundled in the wheel
+- **`fieldkit/src/fieldkit/eval/rubrics/{patent_claim_validity,office_action_argument}.md`** — system-prompt markdown shipped alongside the module. Loaded lazily via the new **`load_rubric(name)`** helper (and exposed via the **`RUBRIC_PATENT_CLAIM_VALIDITY`** / **`RUBRIC_OFFICE_ACTION_ARGUMENT`** module constants for the common case). `[tool.hatch.build.targets.wheel].include` extended with `src/fieldkit/eval/rubrics/*.md` so the markdown lands in the wheel.
+### Added — `fieldkit.eval.vertical` live-callable dispatch
+- **`PATENT_STRATEGIST_SCORER_FNS: dict[str, Callable[..., float]]`** — companion to the existing string-keyed `PATENT_STRATEGIST_SCORERS` map. Resolves the four T6 scorers + the promoted `mcq_letter` to live functions (skips the two `judge_rubric` slots ("C", "E") which are open-ended `Judge.grade(...)` calls without a single named scorer fn). Drift-detection test asserts every fn's `__name__` matches the matching string-map entry.
+### Test suite
+**+93 new tests** across three new test files + the existing vertical-bench test class:
+- `tests/eval/test_irac_structure.py` — perfect / partial / per-component-detector coverage; quarter-granularity parametrize; whitespace-only / empty / expected-arg-ignored edges.
+- `tests/eval/test_prior_art_relevance.py` — perfect / reversed / partial-overlap; string-parsing variants (JSON, comma, newline-numbered, bullet, paren-numbered); Likert MSE branch (perfect, off-by-one, length-mismatch fallback, non-numeric); dataclass shape (frozen, three fields); the known-value `n=4` swap (ρ=0.8) plus the dup-skip test that drove the `_rankify`-on-paired-vectors fix.
+- `tests/eval/test_judge_backed_scorers.py` — `load_rubric` round-trip + missing-file error; `_format_rubric_hints` (empty / scalar / list-bullet / sorted-determinism / nested-dict JSON); both judge-backed scorers wired against a `_FakeJudge` fixture (no network) covering happy path, `None`-score fallback to `0.0`, rubric→`Hints:` threading, empty-reference collapse to `None`; signature-introspection tests ensuring `judge` and `rubric` stay keyword-only so `VerticalBench.scorer_kwargs` plumbing works.
+- `tests/test_vertical_bench.py::TestPatentStrategistFormat` — 3 new tests: `PATENT_STRATEGIST_SCORER_FNS` resolves each key to the expected callable; name-map vs fn-map drift assertion; full end-to-end `VerticalBench.run` exercising `irac_structure` over a 2-row JSONL with one perfect and one half-formed IRAC response (mean accuracy = 0.75).
+Total suite: **507 passed, 2 skipped** offline (`pytest -q`, `/tmp/fk` venv). The 2 skips are the long-standing `--spark`-gated live-NIM / pgvector integration tests.
+### Articles in this release
+- `articles/becoming-a-patent-strategist-on-spark/` — patent-strategist v1.0 article (W3 publish target per spec §1 deliverables). T6's scorer build-out is the load-bearing dependency for the article's bench-comparison numbers; v0.4.3 is the version the article will pin against.
 ## [0.4.2] — 2026-05-15
 Patch release. Two card-rendering polish lifts on `fieldkit.publish` driven by the 2026-05-15 cyber-vertical cycle (`Orionfold/SecurityLLM-GGUF`, the third vertical card on this surface — zero fieldkit source changes between Saul / cyber, the v0.4.1 publishing surface generalized exactly as designed). Both lifts are additive (one new `ModelCard` field already shipped on `main` in `ff1b92f`; one new `ArtifactManifest` field added here). No new modules, no new public classes, no breaking changes — purely a tightening pass.

{fieldkit-0.4.2 → fieldkit-0.4.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: fieldkit
-Version: 0.4.2
+Version: 0.4.3
 Summary: Verified-on-Spark patterns lifted from the ai-field-notes blog into one importable Python package.
 Project-URL: Homepage, https://ainative.business/fieldkit/
 Project-URL: Source, https://github.com/manavsehgal/ai-field-notes/tree/main/fieldkit

{fieldkit-0.4.2 → fieldkit-0.4.3}/docs/api/eval.md RENAMED Viewed

@@ -54,6 +54,14 @@ from fieldkit.eval import (
     # v0.4.x — vertical-curator surface
     VerticalBench, VerticalQA,
     contains, exact_match, numeric_match,
+    # v0.4.3 — patent-strategist scorers
+    mcq_letter,
+    irac_structure,
+    prior_art_relevance, prior_art_relevance_full, PriorArtRelevanceResult,
+    patent_claim_validity, office_action_argument,
+    RUBRIC_PATENT_CLAIM_VALIDITY, RUBRIC_OFFICE_ACTION_ARGUMENT,
+    load_rubric,
 )
 ```
@@ -292,6 +300,52 @@ numeric_match("Revenue was $4.55B", "4.5B",
 | `contains(p, e)` | The model is asked to answer in prose and the reference is a key fact/number/phrase that must appear somewhere in the answer. |
 | `numeric_match(p, e, *, rel_tolerance=0.01)` | FinanceBench-style quantitative answers. Extracts the first number from each side (commas stripped), compares under relative tolerance. Defaults to ±1% per FinanceBench's grading convention. Returns 0.0 if either side has no parseable number — including refusals, so the refusal counter elsewhere doesn't need to gate this scorer. |
+### Patent-strategist scorers *(v0.4.3)*
+Five scorers + two rubric constants land in v0.4.3 to round out the `format='patent-strategist'` branch of `VerticalBench`. Wire them through `VerticalBench(scorer=…, scorer_kwargs=…)` or import the live-callable dispatch map at `fieldkit.eval.vertical.PATENT_STRATEGIST_SCORER_FNS`. The 1-paragraph-per-scorer cheat sheet:
+#### `mcq_letter(predicted, expected, *, strip_think=True) -> float`
+MCQ letter scorer promoted from `scripts/g3_*.py` after three vertical-bench reuses (cybermetric, medmcqa, patent-strategist). Decision order: stripped one-letter (`"B"`), then `"answer: X"` / `"answer is X"` / `"option X"` / `"choice X"`, then first word-bounded `[A-D]`. Case-insensitive throughout. When `strip_think=True` (default), `<think>...</think>` blocks are regex-stripped *before* the three-step decision — keeps reasoning-trace verbosity on R1-distill family models from polluting the letter pick. The flag is a no-op regex on cyber/medical text without `<think>` tags, so existing callers flip the default on safely.
+#### `irac_structure(predicted, expected="") -> float`
+Deterministic 4-checklist Patent-Bar IRAC detector. Returns one of `{0.0, 0.25, 0.5, 0.75, 1.0}` based on Issue / Rule / Application / Conclusion regex hits. Tolerant patterns: markdown headings, all-caps section labels, transition prose (`"Whether…"`, `"Under 35 USC 103…"`, `"Here…"`, `"Therefore…"`) all count. `expected` is ignored — the scorer measures structural form, not factual agreement; kept in the signature for `VerticalBench` compatibility. False positives are far less harmful than false negatives at this granularity; the score's job is to flag *structural absence*, not grade rhetorical polish.
+#### `prior_art_relevance(predicted, expected) -> float`
+Spearman ρ between predicted and gold prior-art rankings — the bench-facing scalar per `specs/patent-strategist-v1.md` §3.3. Accepts `list[str]` directly or a tolerant string parse (JSON arrays `'["a","b","c"]'`, comma-separated `"a, b, c"`, or newline-separated with `1.` / `1)` / `- ` / `* ` prefixes stripped). Items missing from `predicted` get worst-rank padding so omissions still penalize. The paired-rank vectors get re-rankified before correlation so positional gaps from dup-skipping or padding collapse to contiguous ranks — without this, `["a","a","b","c"]` vs `["a","b","c"]` would yield ρ≈0.98 instead of 1.0.
+#### `prior_art_relevance_full(predicted, expected) -> PriorArtRelevanceResult`
+Returns the same ρ plus an `mse_likert` field (populated only when both sides parse as numeric Likert vectors, e.g. `"5,4,3,2,1"`) and an `n` count, packaged as a frozen `PriorArtRelevanceResult(spearman_rho, mse_likert, n)` dataclass. The bench surface uses `prior_art_relevance` because the scorer contract is `Callable[..., float]`; this full variant is for callers that want both metrics in a single pass.
+#### `patent_claim_validity(predicted, expected, *, judge, rubric=None) -> float`
+PatentScore-methodology 7-dim claim-validity scorer (novelty / non-obviousness / written-description / enablement / indefiniteness / subject-matter-eligibility / dependent-claim-structure). LLM-judge backed; caller supplies a `Judge` instance constructed with `rubric=RUBRIC_PATENT_CLAIM_VALIDITY`. Per-row `rubric` dict (convention keys: `cited_prior_art`, `claim_type`, `dependency_target`, `statutory_focus`) renders into a deterministic sorted `Hints:` block fed to the judge as context. Returns the parsed score, mapping `None` → `0.0` so bench accuracy-averaging stays well-defined. **PatentScore methodology only — no data reuse from the cited paper** (license unclear).
+```python
+from fieldkit.eval import Judge, RUBRIC_PATENT_CLAIM_VALIDITY, patent_claim_validity
+from fieldkit.nim import NIMClient
+with NIMClient(base_url="http://localhost:8000/v1", model="...") as c:
+    judge = Judge(client=c, rubric=RUBRIC_PATENT_CLAIM_VALIDITY)
+    score = patent_claim_validity(
+        predicted_claim_text,
+        reference_claim_text,
+        judge=judge,
+        rubric={"cited_prior_art": ["US10987654", "US20210123456"]},
+    )
+```
+#### `office_action_argument(predicted, expected, *, judge, rubric=None) -> float`
+4-dim office-action-response scorer (rejection-type identification, statutory citation accuracy, argument structure, persuasiveness). Same `Judge`-wrapping shape as `patent_claim_validity`; pair with `RUBRIC_OFFICE_ACTION_ARGUMENT`. Convention rubric keys: `rejection_type` (`102` / `103` / `112(a)` / `112(b)` / `101` / `double-patenting` / `restriction`), `required_citations` (list of expected MPEP/CFR/case cites), `claim_count`, `relies_on_official_notice`.
+#### Rubric loader: `load_rubric(name) -> str`
+The two `RUBRIC_PATENT_CLAIM_VALIDITY` and `RUBRIC_OFFICE_ACTION_ARGUMENT` module constants are populated at import time from markdown files shipped under `fieldkit/eval/rubrics/`. Pass `load_rubric("patent_claim_validity")` to re-read the file (or your own rubric named `my_rubric.md` if you ship a fork). The `[tool.hatch.build.targets.wheel].include` glob ships `*.md` under that subtree, so the rubrics travel with the wheel.
 ## Samples
 - [`samples/bench-rag.py`](https://github.com/manavsehgal/ai-field-notes/blob/main/fieldkit/samples/bench-rag.py) — offline `Bench` + `Judge.parse` walkthrough.

{fieldkit-0.4.2 → fieldkit-0.4.3}/pyproject.toml RENAMED Viewed

@@ -59,6 +59,7 @@ packages = ["src/fieldkit"]
 include = [
   "src/fieldkit/**/*.py",
   "src/fieldkit/**/data/*.json",
+  "src/fieldkit/eval/rubrics/*.md",
 ]
 [tool.hatch.build.targets.sdist]

{fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/_version.py RENAMED Viewed

@@ -6,4 +6,4 @@
 build time, so bumping it here is enough to bump the wheel version too.
 """
-__version__ = "0.4.2"
+__version__ = "0.4.3"

{fieldkit-0.4.2 → fieldkit-0.4.3}/src/fieldkit/eval/__init__.py RENAMED Viewed

@@ -45,6 +45,8 @@ __all__ = [
     "REFUSAL_PATTERNS",
     "RUBRIC_CORRECTNESS",
     "RUBRIC_FAITHFULNESS",
+    "RUBRIC_OFFICE_ACTION_ARGUMENT",
+    "RUBRIC_PATENT_CLAIM_VALIDITY",
     "RUBRIC_RELEVANCE",
     "AgentRun",
     "AssertionGrader",
@@ -60,6 +62,7 @@ __all__ = [
     "MatchedBaseComparisonResult",
     "PassAtK",
     "PassAtKResult",
+    "PriorArtRelevanceResult",
     "Trajectory",
     "TrajectoryIter",
     "TurnDetail",
@@ -67,9 +70,16 @@ __all__ = [
     "VerticalQA",
     "contains",
     "exact_match",
+    "irac_structure",
     "is_refusal",
+    "load_rubric",
+    "mcq_letter",
     "numeric_match",
+    "office_action_argument",
     "pass_at_k_estimator",
+    "patent_claim_validity",
+    "prior_art_relevance",
+    "prior_art_relevance_full",
     "summarize_agent_runs",
     "summarize_metric",
 ]
@@ -108,6 +118,415 @@ def is_refusal(text: str | None) -> bool:
     return any(p.search(text) for p in REFUSAL_PATTERNS)
+# --- MCQ letter scorer ---------------------------------------------------
+_MCQ_AFTER_ANSWER_RE = re.compile(
+    r"\b(?:answer|choice|option)\b[^A-Za-z0-9]{0,20}([A-D])\b",
+    re.IGNORECASE,
+)
+_MCQ_BOUNDED_RE = re.compile(r"\b([A-D])\b", re.IGNORECASE)
+_THINK_BLOCK_RE = re.compile(r"<think>.*?</think>", re.DOTALL)
+def mcq_letter(
+    predicted: str,
+    expected: str,
+    *,
+    strip_think: bool = True,
+) -> float:
+    """Score MCQ letter responses, with `<think>`-aware extraction for
+    reasoning models.
+    Promoted to `fieldkit.eval` after three vertical-bench reuses
+    (cybermetric, medmcqa, patent-strategist). Decision order:
+    (a) stripped one-letter output ("B"); (b) "answer: X" / "answer is X" /
+    "option X" / "choice X" with X in [A-D]; (c) first word-bounded [A-D]
+    in the response. Case-insensitive throughout.
+    When `strip_think=True` (default), `<think>...</think>` blocks are
+    regex-stripped from `predicted` *before* the three-step decision —
+    keeps reasoning-trace verbosity from polluting the letter pick on
+    R1-distill family models. No-op on text without `<think>` tags, so
+    cyber/medical callers can flip the default on safely.
+    """
+    pred = (predicted or "")
+    if strip_think:
+        pred = _THINK_BLOCK_RE.sub("", pred)
+    pred = pred.strip()
+    exp = (expected or "").strip().upper()
+    if not pred or exp not in ("A", "B", "C", "D"):
+        return 0.0
+    stripped = pred.upper().strip(".,)!:- ")
+    if len(stripped) <= 1 and stripped in ("A", "B", "C", "D"):
+        return 1.0 if stripped == exp else 0.0
+    m = _MCQ_AFTER_ANSWER_RE.search(pred)
+    if m:
+        return 1.0 if m.group(1).upper() == exp else 0.0
+    m = _MCQ_BOUNDED_RE.search(pred)
+    if m:
+        return 1.0 if m.group(1).upper() == exp else 0.0
+    return 0.0
+# --- Patent-strategist specialty scorers ---------------------------------
+# Added in v0.4.3 alongside the `format='patent-strategist'` branch of
+# `VerticalBench`. Two are LLM-judge-backed (`patent_claim_validity`,
+# `office_action_argument` — they wrap a caller-supplied `Judge`), one is
+# deterministic regex-checklist (`irac_structure`), and one is a pure
+# Spearman-rank reducer over ranked prior-art lists (`prior_art_relevance`).
+# Per `specs/patent-strategist-v1.md` §3.3. Rubric markdown ships alongside
+# this module under `rubrics/` and loads via `load_rubric()`.
+def load_rubric(name: str) -> str:
+    """Return the text of a rubric markdown file shipped with `fieldkit.eval`.
+    `name` is the bare filename without extension — e.g. ``"patent_claim_validity"``
+    resolves to ``fieldkit/eval/rubrics/patent_claim_validity.md``. The file
+    is read once per call (small files, no caching needed for the wheel-bundled
+    surface).
+    """
+    rubrics_dir = Path(__file__).parent / "rubrics"
+    path = rubrics_dir / f"{name}.md"
+    return path.read_text(encoding="utf-8")
+RUBRIC_PATENT_CLAIM_VALIDITY: str = load_rubric("patent_claim_validity")
+"""System prompt for the PatentScore-methodology 7-dim claim-validity rubric.
+Pass to `Judge(client=..., rubric=RUBRIC_PATENT_CLAIM_VALIDITY)` and feed
+into `patent_claim_validity(predicted, expected, judge=<that judge>)`."""
+RUBRIC_OFFICE_ACTION_ARGUMENT: str = load_rubric("office_action_argument")
+"""System prompt for the 4-dim office-action-response rubric (rejection-type
+ID, statutory citation accuracy, argument structure, persuasiveness)."""
+def _format_rubric_hints(rubric: dict[str, Any] | None) -> str:
+    """Render a per-row rubric dict as a stable `Hints:` block.
+    Keys are sorted for determinism; list values get bullet-rendered; nested
+    objects fall through to `json.dumps`. Returns ``""`` on empty/None input.
+    """
+    if not rubric:
+        return ""
+    lines = ["Hints:"]
+    for k in sorted(rubric.keys()):
+        v = rubric[k]
+        if isinstance(v, (list, tuple)):
+            lines.append(f"- {k}:")
+            for item in v:
+                lines.append(f"  - {item}")
+        elif isinstance(v, dict):
+            lines.append(f"- {k}: {json.dumps(v, sort_keys=True)}")
+        else:
+            lines.append(f"- {k}: {v}")
+    return "\n".join(lines)
+def patent_claim_validity(
+    predicted: str,
+    expected: str,
+    *,
+    judge: Judge,
+    rubric: dict[str, Any] | None = None,
+) -> float:
+    """Score a predicted patent claim against a reference via the PatentScore
+    7-dim rubric (`RUBRIC_PATENT_CLAIM_VALIDITY`).
+    Caller supplies a `Judge` instance constructed with
+    `rubric=RUBRIC_PATENT_CLAIM_VALIDITY`; this function feeds the prediction
+    + reference (+ optional per-row `rubric` dict rendered as `Hints:`) into
+    `judge.grade()` and returns the parsed score, mapping ``None`` →
+    ``0.0`` so the bench's accuracy-averaging stays well-defined.
+    Per-row `rubric` keys are convention rather than enforcement — typical
+    examples include ``cited_prior_art``, ``claim_type``
+    (``independent`` / ``dependent``), ``dependency_target``,
+    ``statutory_focus`` (e.g. ``["102", "103"]``).
+    """
+    hints = _format_rubric_hints(rubric)
+    context: str | None = hints or None
+    result = judge.grade(
+        prediction=predicted,
+        reference=expected or None,
+        context=context,
+    )
+    return result.score if result.score is not None else 0.0
+def office_action_argument(
+    predicted: str,
+    expected: str,
+    *,
+    judge: Judge,
+    rubric: dict[str, Any] | None = None,
+) -> float:
+    """Score an attorney's predicted office-action response via the 4-dim
+    rubric (`RUBRIC_OFFICE_ACTION_ARGUMENT`).
+    Caller supplies a `Judge` constructed with that rubric; `predicted` is
+    the attorney response text, `expected` is the reference response (or
+    empty when the row's gold is a rubric dict only). Per-row `rubric`
+    convention keys: ``rejection_type`` (``102`` / ``103`` / ``112(a)`` /
+    ``112(b)`` / ``101`` / ``double-patenting`` / ``restriction``),
+    ``required_citations`` (list of expected MPEP/CFR/case cites),
+    ``claim_count``, ``relies_on_official_notice`` (bool).
+    """
+    hints = _format_rubric_hints(rubric)
+    context: str | None = hints or None
+    result = judge.grade(
+        prediction=predicted,
+        reference=expected or None,
+        context=context,
+    )
+    return result.score if result.score is not None else 0.0
+# IRAC structure detector — one regex per component. Each pattern matches if
+# the predicted response signals the component's presence via a section
+# heading, transition phrase, or canonical lead-in. Patterns are deliberately
+# tolerant (markdown / plain prose / numbered lists all hit) — false positives
+# are far less harmful than false negatives at the 0.25-granularity we report.
+_IRAC_PATTERNS: dict[str, re.Pattern[str]] = {
+    "issue": re.compile(
+        r"\b(?:issue\s*[:\s]|the\s+issue\b|question\s+presented\b|whether\b)",
+        re.IGNORECASE,
+    ),
+    "rule": re.compile(
+        r"(?:\brule\b|\bthe\s+rule\b|"
+        r"\bunder\s+(?:35\s+u\.?s\.?c\.?|the\s+(?:statute|law|holding|standard|mpep))|"
+        r"\bmpep\s+(?:§|sec(?:tion)?\.?)?\s*\d|"
+        r"\b35\s+u\.?s\.?c\.?\s*(?:§|sec(?:tion)?\.?)?\s*\d{2,3}|"
+        r"\blaw\s+(?:provides|states|requires)\b|"
+        r"\bholding\b)",
+        re.IGNORECASE,
+    ),
+    "application": re.compile(
+        r"(?:\bapplication\b|\bapplying\b|\bhere[,\s]|\bin\s+this\s+case\b|"
+        r"\bin\s+the\s+(?:present|instant)\b|"
+        r"\bthe\s+(?:present|instant)\s+(?:claim|application|invention)\b|"
+        r"\bas\s+applied\b|\bapplied\s+to\b)",
+        re.IGNORECASE,
+    ),
+    "conclusion": re.compile(
+        r"(?:\bconclusion\b|\bin\s+conclusion\b|\btherefore\b|\baccordingly\b|"
+        r"\bthus[,\s]|"
+        r"\bfor\s+(?:the\s+)?(?:above|foregoing)\s+reasons\b|"
+        r"\bin\s+sum\b|"
+        r"\bthe\s+(?:examiner|applicant)\s+(?:should|is\s+respectfully\s+requested)\b)",
+        re.IGNORECASE,
+    ),
+}
+def irac_structure(predicted: str, expected: str = "") -> float:
+    """Deterministic 4-checklist IRAC structure scorer.
+    Returns one of ``{0.0, 0.25, 0.5, 0.75, 1.0}`` — one quarter per IRAC
+    component (Issue / Rule / Application / Conclusion) detected via
+    regex on `predicted`. `expected` is ignored (the scorer measures
+    structural form, not factual agreement) but kept in the signature
+    for `VerticalBench` compatibility.
+    Detection patterns are deliberately tolerant: markdown headings,
+    plain-prose transitions, and canonical Patent-Bar lead-ins
+    ("Whether…", "Under 35 USC 103…", "Here…", "Therefore…") all count.
+    False positives are far less harmful than false negatives at this
+    granularity — the score's job is to flag *structural absence*, not
+    to grade rhetorical polish.
+    """
+    if not predicted:
+        return 0.0
+    hits = sum(1 for pat in _IRAC_PATTERNS.values() if pat.search(predicted))
+    return hits / 4.0
+@dataclass(frozen=True, slots=True)
+class PriorArtRelevanceResult:
+    """Full result from `prior_art_relevance_full` — both rank-correlation and
+    Likert-MSE figures. The bench-facing `prior_art_relevance` scorer
+    surfaces just `spearman_rho` per spec §3.3; callers who need both
+    metrics import this dataclass directly.
+    """
+    spearman_rho: float
+    mse_likert: float | None
+    n: int
+_RANKED_LIST_RE = re.compile(r"^[\s\-\*\d.\)]*([^\s].*?)\s*$")
+def _parse_ranked_list(s: str) -> list[str]:
+    """Tolerant parser: JSON array, comma-separated, or newline-separated.
+    Strips numeric / bullet / paren prefixes ("1.", "1)", "- ", "* ") so
+    a model can emit "1. doc-a\n2. doc-b" and the scorer recovers
+    ``["doc-a", "doc-b"]``. Empty/whitespace lines are dropped.
+    """
+    raw = (s or "").strip()
+    if not raw:
+        return []
+    if raw.startswith("["):
+        try:
+            arr = json.loads(raw)
+            if isinstance(arr, list):
+                return [str(x).strip() for x in arr if str(x).strip()]
+        except json.JSONDecodeError:
+            pass
+    sep = "\n" if "\n" in raw else ","
+    items: list[str] = []
+    for chunk in raw.split(sep):
+        chunk = chunk.strip()
+        if not chunk:
+            continue
+        m = _RANKED_LIST_RE.match(chunk)
+        if m:
+            items.append(m.group(1).strip())
+        else:
+            items.append(chunk)
+    return items
+def _spearman_rho(ranks_a: list[float], ranks_b: list[float]) -> float:
+    """Spearman rank correlation. Both inputs must be equal-length rank
+    vectors (already converted from raw values to ranks). Returns 0.0 for
+    n<2 (insufficient signal to correlate)."""
+    n = len(ranks_a)
+    if n < 2 or len(ranks_b) != n:
+        return 0.0
+    mean_a = sum(ranks_a) / n
+    mean_b = sum(ranks_b) / n
+    cov = sum((ranks_a[i] - mean_a) * (ranks_b[i] - mean_b) for i in range(n))
+    var_a = sum((r - mean_a) ** 2 for r in ranks_a)
+    var_b = sum((r - mean_b) ** 2 for r in ranks_b)
+    denom = math.sqrt(var_a * var_b)
+    if denom == 0.0:
+        return 0.0
+    return cov / denom
+def prior_art_relevance_full(
+    predicted: str | list[str],
+    expected: str | list[str],
+) -> PriorArtRelevanceResult:
+    """Compute Spearman ρ (and Likert MSE when both sides are numeric) between
+    a predicted ranked list of prior-art IDs and a gold ranked list.
+    Accepts list-of-str directly or a string (JSON / comma / newline). Items
+    in `predicted` that don't appear in `expected` are dropped; items in
+    `expected` that don't appear in `predicted` are assigned the worst
+    available rank (so omissions still penalize). Comparison is on the
+    overlap, padded for missing items.
+    `mse_likert` is computed only when both sides parse as numeric Likert
+    scores (1-5 relevance ratings rather than ID lists); otherwise it's
+    ``None``. The bench-facing `prior_art_relevance` surface uses
+    `spearman_rho` per spec §3.3.
+    """
+    pred_items = predicted if isinstance(predicted, list) else _parse_ranked_list(predicted)
+    gold_items = expected if isinstance(expected, list) else _parse_ranked_list(expected)
+    pred_items = [str(x) for x in pred_items]
+    gold_items = [str(x) for x in gold_items]
+    # Likert MSE branch — both sides parse as numbers and same length.
+    pred_nums = _try_parse_numeric_list(pred_items)
+    gold_nums = _try_parse_numeric_list(gold_items)
+    mse_likert: float | None
+    if pred_nums is not None and gold_nums is not None and len(pred_nums) == len(gold_nums) and pred_nums:
+        n = len(pred_nums)
+        mse_likert = sum((pred_nums[i] - gold_nums[i]) ** 2 for i in range(n)) / n
+        # When both sides are Likert vectors, Spearman runs on the numbers themselves.
+        rho = _spearman_rho(_rankify(pred_nums), _rankify(gold_nums))
+        return PriorArtRelevanceResult(spearman_rho=rho, mse_likert=mse_likert, n=n)
+    mse_likert = None
+    # ID-list branch — rank by position; missing-from-pred → worst available.
+    if not gold_items:
+        return PriorArtRelevanceResult(spearman_rho=0.0, mse_likert=None, n=0)
+    # Gold rank: 1 = best, N = worst.
+    gold_rank: dict[str, int] = {item: i + 1 for i, item in enumerate(gold_items)}
+    n_gold = len(gold_items)
+    # Predicted rank for each gold item: position in `pred_items`, or n_gold+1
+    # (worst-plus-one) if absent.
+    pred_rank: dict[str, int] = {}
+    seen: set[str] = set()
+    for i, item in enumerate(pred_items):
+        if item not in seen:
+            pred_rank[item] = i + 1
+            seen.add(item)
+    paired_gold: list[float] = []
+    paired_pred: list[float] = []
+    worst = n_gold + 1
+    for item in gold_items:
+        paired_gold.append(float(gold_rank[item]))
+        paired_pred.append(float(pred_rank.get(item, worst)))
+    # Re-rank both vectors before Spearman so positional gaps (from
+    # duplicates skipped in pred, or worst-rank padding) collapse to
+    # contiguous 1..N ranks. Without this, ["a","a","b","c"] vs ["a","b","c"]
+    # would produce pred-rank-vector [1,3,4] vs gold [1,2,3] and yield ρ≈0.98
+    # instead of the intuitive 1.0 (the pred order *is* monotonic in gold).
+    rho = _spearman_rho(_rankify(paired_pred), _rankify(paired_gold))
+    return PriorArtRelevanceResult(spearman_rho=rho, mse_likert=mse_likert, n=n_gold)
+def prior_art_relevance(
+    predicted: str | list[str],
+    expected: str | list[str],
+) -> float:
+    """Spearman ρ between predicted and gold prior-art rankings.
+    Bench-facing wrapper around `prior_art_relevance_full` — returns just
+    `spearman_rho` so it slots into `VerticalBench`'s
+    ``Callable[..., float]`` scorer contract. ρ ranges over ``[-1.0, 1.0]``;
+    bench averages it across questions per spec §3.3.
+    See `prior_art_relevance_full` for input parsing rules (JSON array,
+    comma-separated, newline-separated all accepted; list[str] accepted
+    directly) and the Likert-MSE second metric.
+    """
+    return prior_art_relevance_full(predicted, expected).spearman_rho
+def _try_parse_numeric_list(items: list[str]) -> list[float] | None:
+    """Return a list of floats if every item parses as a number, else None."""
+    if not items:
+        return None
+    out: list[float] = []
+    for x in items:
+        try:
+            out.append(float(x))
+        except (TypeError, ValueError):
+            return None
+    return out
+def _rankify(values: list[float]) -> list[float]:
+    """Convert a value vector to its average-rank vector (1-indexed).
+    Ties get the mean of the ranks they span — standard Spearman tie-handling.
+    """
+    n = len(values)
+    indexed = sorted(range(n), key=lambda i: values[i])
+    ranks = [0.0] * n
+    i = 0
+    while i < n:
+        j = i
+        while j + 1 < n and values[indexed[j + 1]] == values[indexed[i]]:
+            j += 1
+        avg_rank = (i + j) / 2.0 + 1.0
+        for k in range(i, j + 1):
+            ranks[indexed[k]] = avg_rank
+        i = j + 1
+    return ranks
 # --- Bench ---------------------------------------------------------------

fieldkit-0.4.3/src/fieldkit/eval/rubrics/office_action_argument.md ADDED Viewed

@@ -0,0 +1,51 @@
+You are an impartial patent-prosecution grader scoring an attorney's
+*predicted* office-action response against a *reference* response (or a
+rubric specifying the expected rejection type + key citations) on 4
+dimensions, each 0-1. Return the arithmetic mean as `score`.
+**Dimensions**
+1. **Rejection-type identification** — does the response correctly identify
+   the rejection's statutory basis? Valid types include §101 (subject matter
+   eligibility), §102 (anticipation), §103 (obviousness), §112(a) (written
+   description / enablement), §112(b) (indefiniteness), double-patenting
+   (statutory or obviousness-type), and restriction requirements. 1.0 =
+   correct type AND statutory subsection cited · 0.5 = type correct,
+   subsection wrong or missing · 0.0 = wrong type.
+2. **Statutory citation accuracy** — are all cited statutes, CFR rules, and
+   MPEP sections accurate (correct section + subsection numbering, no
+   fabricated citations)? Penalize hallucinated MPEP sections (very common
+   failure mode for under-trained models). 1.0 = all cites accurate ·
+   0.5 = mostly correct with one minor error · 0.0 = fabricated or
+   substantially wrong cites.
+3. **Argument structure** — does the response follow the canonical
+   prosecution response shape: (a) restate the rejection, (b) traverse with
+   reasoning grounded in case law / MPEP, (c) propose amendment if needed
+   with support citation, (d) summary statement / request for allowance?
+   Penalize bare denials, amendments without §112(a) support pointers, and
+   missing claim-by-claim treatment when the rejection lists multiple claims.
+4. **Persuasiveness** — would an examiner read this and find it credible
+   enough to merit withdrawing the rejection (or, at minimum, issuing a Final
+   with substantive new analysis rather than restating)? Score the *technical
+   substance* of the argument, not its tone. Penalize attorney-argument-only
+   responses (without evidence/declarations) where the rejection rests on
+   official notice that requires rebuttal evidence.
+**Inputs**
+The user message provides the predicted response text, the reference text
+(if any), and optionally per-row rubric hints (e.g. expected rejection_type,
+required_citations list, claim_count) under a `Hints:` heading. Use the
+hints to anchor scoring; missing reference = score on facial quality and note
+the assumption.
+**Output**
+Return ONLY a JSON object:
+```json
+{"score": 0.65, "rationale": "Identifies §103 correctly and cites MPEP 2143 (KSR rationales). Weakest on argument-structure (no §112(a) support pointer for the proposed amendment) and persuasiveness (relies on attorney argument alone where rejection cites official notice — declaration would strengthen)."}
+```

fieldkit 0.4.2__tar.gz → 0.4.3__tar.gz

fieldkit 0.4.2tar.gz → 0.4.3tar.gz