fieldkit 0.4.2__tar.gz → 0.5.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (68) hide show
  1. {fieldkit-0.4.2 → fieldkit-0.5.0}/.gitignore +5 -1
  2. {fieldkit-0.4.2 → fieldkit-0.5.0}/CHANGELOG.md +225 -0
  3. {fieldkit-0.4.2 → fieldkit-0.5.0}/PKG-INFO +1 -1
  4. {fieldkit-0.4.2 → fieldkit-0.5.0}/docs/api/eval.md +63 -1
  5. {fieldkit-0.4.2 → fieldkit-0.5.0}/docs/api/publish.md +4 -0
  6. {fieldkit-0.4.2 → fieldkit-0.5.0}/docs/api/quant.md +16 -8
  7. {fieldkit-0.4.2 → fieldkit-0.5.0}/docs/api/rag.md +2 -2
  8. fieldkit-0.5.0/docs/api/training.md +605 -0
  9. {fieldkit-0.4.2 → fieldkit-0.5.0}/pyproject.toml +2 -0
  10. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/_version.py +1 -1
  11. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/eval/__init__.py +422 -0
  12. fieldkit-0.5.0/src/fieldkit/eval/rubrics/office_action_argument.md +51 -0
  13. fieldkit-0.5.0/src/fieldkit/eval/rubrics/patent_claim_validity.md +53 -0
  14. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/eval/vertical.py +105 -3
  15. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/training/__init__.py +99 -0
  16. fieldkit-0.5.0/src/fieldkit/training/convert.py +256 -0
  17. fieldkit-0.5.0/src/fieldkit/training/data/decide-entries/.gitkeep +0 -0
  18. fieldkit-0.5.0/src/fieldkit/training/data/decide-entries/2026-05-22-paired-bakeoff.yaml +53 -0
  19. fieldkit-0.5.0/src/fieldkit/training/decide.py +515 -0
  20. fieldkit-0.5.0/src/fieldkit/training/probe.py +904 -0
  21. fieldkit-0.5.0/src/fieldkit/training/recipe.py +429 -0
  22. fieldkit-0.5.0/src/fieldkit/training/run.py +894 -0
  23. fieldkit-0.5.0/tests/eval/__init__.py +0 -0
  24. fieldkit-0.5.0/tests/eval/test_irac_structure.py +169 -0
  25. fieldkit-0.5.0/tests/eval/test_judge_backed_scorers.py +233 -0
  26. fieldkit-0.5.0/tests/eval/test_mcq_letter.py +120 -0
  27. fieldkit-0.5.0/tests/eval/test_prior_art_relevance.py +137 -0
  28. fieldkit-0.5.0/tests/test_training_convert.py +279 -0
  29. fieldkit-0.5.0/tests/test_training_decide.py +625 -0
  30. fieldkit-0.5.0/tests/test_training_probe.py +797 -0
  31. fieldkit-0.5.0/tests/test_training_recipe.py +417 -0
  32. fieldkit-0.5.0/tests/test_training_run.py +759 -0
  33. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/test_vertical_bench.py +318 -0
  34. fieldkit-0.4.2/docs/api/training.md +0 -85
  35. {fieldkit-0.4.2 → fieldkit-0.5.0}/LICENSE +0 -0
  36. {fieldkit-0.4.2 → fieldkit-0.5.0}/README.md +0 -0
  37. {fieldkit-0.4.2 → fieldkit-0.5.0}/docs/api/capabilities.md +0 -0
  38. {fieldkit-0.4.2 → fieldkit-0.5.0}/docs/api/cli.md +0 -0
  39. {fieldkit-0.4.2 → fieldkit-0.5.0}/docs/api/lineage.md +0 -0
  40. {fieldkit-0.4.2 → fieldkit-0.5.0}/docs/api/nim.md +0 -0
  41. {fieldkit-0.4.2 → fieldkit-0.5.0}/samples/bench-rag.py +0 -0
  42. {fieldkit-0.4.2 → fieldkit-0.5.0}/samples/feasibility-math.py +0 -0
  43. {fieldkit-0.4.2 → fieldkit-0.5.0}/samples/hello-lineage.py +0 -0
  44. {fieldkit-0.4.2 → fieldkit-0.5.0}/samples/hello-nim.py +0 -0
  45. {fieldkit-0.4.2 → fieldkit-0.5.0}/samples/naive-rag.py +0 -0
  46. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/__init__.py +0 -0
  47. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/capabilities/__init__.py +0 -0
  48. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/capabilities/data/__init__.py +0 -0
  49. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/capabilities/data/spark-capabilities.json +0 -0
  50. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/cli/__init__.py +0 -0
  51. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/lineage/__init__.py +0 -0
  52. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/nim/__init__.py +0 -0
  53. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/publish/__init__.py +0 -0
  54. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/quant/__init__.py +0 -0
  55. {fieldkit-0.4.2 → fieldkit-0.5.0}/src/fieldkit/rag/__init__.py +0 -0
  56. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/__init__.py +0 -0
  57. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/conftest.py +0 -0
  58. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/test_capabilities.py +0 -0
  59. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/test_cli.py +0 -0
  60. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/test_eval.py +0 -0
  61. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/test_lineage.py +0 -0
  62. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/test_nim.py +0 -0
  63. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/test_nim_spark.py +0 -0
  64. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/test_publish.py +0 -0
  65. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/test_quant.py +0 -0
  66. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/test_rag.py +0 -0
  67. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/test_rag_spark.py +0 -0
  68. {fieldkit-0.4.2 → fieldkit-0.5.0}/tests/test_training.py +0 -0
@@ -25,7 +25,8 @@ pnpm-debug.log*
25
25
  # local-only working material (not for the public blog)
26
26
  ideas/
27
27
  HANDOFF.md
28
- .claude/
28
+ .claude/*
29
+ !.claude/skills/
29
30
 
30
31
  # transient vibe-test artifacts (Playwright screenshots written to repo root)
31
32
  .playwright-mcp/
@@ -36,6 +37,9 @@ articles/*/evidence/runs/
36
37
 
37
38
  __pycache__/
38
39
 
40
+ # Unsloth + triton JIT scratch — root-owned, written by the in-container trainer
41
+ unsloth_compiled_cache/
42
+
39
43
  # fieldkit Python build / venv detritus (keep alongside the package, not committed)
40
44
  fieldkit/build/
41
45
  fieldkit/dist/
@@ -6,6 +6,231 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and
6
6
 
7
7
  ## [Unreleased]
8
8
 
9
+ ## [0.5.0] — 2026-05-22
10
+
11
+ The `fieldkit.training` v0.5 build-out — five new modules (`recipe`, `convert`, `run`, `probe`, `decide`) that lift the patent-strategist v3 paired bakeoff (NeMo Framework vs Unsloth, session 2026-05-21 → 2026-05-22) out of one-shot scripts and into a reusable, symmetric library surface. Drives Article H end-to-end. `fieldkit.training.__all__` grows from 7 → 46 entries; +203 new tests; package suite goes 507 → 710 passed.
12
+
13
+ The release notes below run newest-phase-first (E → A) to match the build order. Cross-phase totals, live-Spark verification, and the article window are summarized at the bottom.
14
+
15
+ ### Added — `fieldkit.training` v0.5 build-out (Phase E: `decide.train_backend` + `refresh` flywheel)
16
+
17
+ Final module of the v0.5 `fieldkit.training` build-out, after Phase A's `recipe.TrainRecipe` (`bee458d`), Phase B's `convert` (`9f2a59f`), Phase C's `run` + `merge_and_export` + `standardize_hf_export` (`2e142e7`), and Phase D's `probe.ReasoningProbe` + `ProbeReport.compare` (`78a3131`). YAML-lookup decision API with a lifecycle filter + a refresh flywheel — the contract that lets every future `articles/*-bakeoff-*` write a decide-entry alongside its prose so the next-session's `train_backend(...)` returns the article's findings programmatically.
18
+
19
+ - **`fieldkit.training.train_backend(*, base_model_family, optimize_for, dirs=None)`** — walks the configured entry directories (default = bundled `SEED_ENTRIES_DIR` + `USER_ENTRIES_DIR`), filters to `lifecycle="active"` entries with `question="train_backend"`, sorts newest-first by `created`, and returns a `DecidePick` with the first finding whose `optimize_for` matches the argument from an entry whose `context.base_model_family` matches. `DecidePick.backend` is an alias for `.pick` to match the v0.5 spec example. Raises `DecideError` with a clear message (lists every active entry's slug + created date for the no-context-match case; lists available `optimize_for` keys for the partial-match case) when no entry covers the cell.
20
+ - **`fieldkit.training.load_entries(*, dirs=None, lifecycle="active", question=None)`** — directory scanner. `lifecycle` accepts a single value, a sequence, or `None`. `question` filters when set. Returns entries sorted by `created` descending. Missing directories are silently skipped; non-YAML/JSON files are ignored.
21
+ - **`fieldkit.training.refresh(*, dirs=None, freshness_days=180, today=None, include_lifecycle=None)`** — the refresh flywheel. Walks every entry (any lifecycle by default — audit signal matters across the full corpus, not just active) and flags any older than `freshness_days`. Returns a list of `StalenessReport` sorted oldest-first.
22
+ - **`fieldkit.training.DecideEntry`** — frozen dataclass for a parsed YAML entry. Constructor enforces `lifecycle in VALID_LIFECYCLES` and at least one finding. Methods: `find(optimize_for=...)`, `matches_context(**constraints)`, `age_days(today=None)`. Classmethods: `from_dict(data, *, path=None)`, `from_yaml(path)`. Pyyaml-optional — falls back to `json.loads` for JSON-shaped entries.
23
+ - **`fieldkit.training.DecideFinding`** — frozen dataclass for one row of an entry's `findings` list. `extra` field preserves forward-compatibility keys.
24
+ - **`fieldkit.training.DecidePick`** — frozen dataclass returned by `train_backend`. Carries `pick` / `backend` (alias) / `evidence` / `entry` (the matched `DecideEntry`) / `optimize_for` / `context` / `entry_path` (alias).
25
+ - **`fieldkit.training.StalenessReport`** — frozen dataclass returned by `refresh`. `entry` / `age_days` / `stale`.
26
+ - **`fieldkit.training.DecideError`** — distinct exception class for decide-layer failures.
27
+ - **`SEED_ENTRIES_DIR` / `USER_ENTRIES_DIR`** — Path constants. Seed dir is `fieldkit/src/fieldkit/training/data/decide-entries/` (bundled in the wheel via `pyproject.toml` package-data include — `src/fieldkit/**/data/**/*.yaml` added this release). User dir is `~/.fieldkit/decide-entries/` (read-after-write, gitignored, created by the caller on first write).
28
+ - **`VALID_LIFECYCLES`** — frozenset of valid lifecycle values: `"active"` (currently authoritative — `train_backend` returns these), `"superseded"` (replaced by newer entry; preserved for audit), `"deprecated"` (explicitly retired; preserved for audit but never returned from lookups).
29
+ - **`DEFAULT_FRESHNESS_DAYS`** — `180`. Six months matches typical hardware / framework / base-model drift cadence.
30
+
31
+ ### YAML schema
32
+
33
+ ```yaml
34
+ slug: 2026-05-22-paired-bakeoff # required, unique within dir
35
+ lifecycle: active # active | superseded | deprecated
36
+ created: 2026-05-22 # ISO date (YYYY-MM-DD)
37
+ question: train_backend # the decide.<name>() entry point
38
+ context: # required mapping
39
+ base_model_family: qwen3-r1-distill
40
+ findings: # required, non-empty
41
+ - optimize_for: patent_chain_length
42
+ pick: nemo
43
+ evidence: "+44% mean chain ..."
44
+ sources: [] # optional, default []
45
+ supersedes: [] # optional, default []
46
+ notes: "free-form annotation" # optional
47
+ ```
48
+
49
+ `SEED_ENTRIES_DIR` ships empty at Phase E landing — the first seed entry (the patent-strategist v3 paired bakeoff) ships alongside Article H so the prose and the YAML stay co-located in the commit log. The wheel-include glob picks it up automatically once the YAML lands.
50
+
51
+ ### Test suite (Phase E)
52
+
53
+ **+53 new tests** in `tests/test_training_decide.py`:
54
+
55
+ - Module constants: `DEFAULT_FRESHNESS_DAYS`, `VALID_LIFECYCLES`, `SEED_ENTRIES_DIR` path shape, `USER_ENTRIES_DIR` path shape.
56
+ - `DecideFinding` — frozen enforcement, extra-keys preservation through `DecideEntry.from_dict`.
57
+ - `DecideEntry.from_dict` — minimal shape, path recording, optional fields carried, native `date` object accepted (pyyaml emits these), missing-required-key + bad-lifecycle + empty-findings + non-mapping-context + finding-missing-pick + bad-iso-date + findings-as-string rejection.
58
+ - `DecideEntry` methods — `find` happy + None, `matches_context` all/partial/empty-constraints, `age_days` with override + negative clamp.
59
+ - `DecideEntry.from_yaml` — JSON-form load (so the suite passes without pyyaml), missing-file + non-mapping rejection.
60
+ - `load_entries` — explicit dir, sorted newest-first, default-active filter, `lifecycle=None` returns all, sequence-of-lifecycles, bad-lifecycle rejection (string + sequence), question filter, non-entry-suffix skip, multi-dir merge, missing-dir tolerance.
61
+ - `train_backend` — happy path with two optimize_for values, newer-entry-wins-on-equal-context, non-active entries skipped, no-context-match error message, no-optimize_for error message, only-train_backend-question, context-copy on result.
62
+ - `refresh` — staleness flagging with `today=` override, oldest-first sort, default-all-lifecycles, `include_lifecycle="active"` filter, negative `freshness_days` rejection, empty corpus, `StalenessReport.age_days` round-trip.
63
+
64
+ All pure-python; YAML / JSON fixtures written to `tmp_path` so no bundled-seed-dir or user-dir filesystem is ever touched. The seed dir + user dir resolution is asserted by Path-constant comparison only.
65
+
66
+ Total suite: **710 passed, 2 skipped** offline (`pytest -q`, `/tmp/fk` venv) — up from 657 at Phase D landing. The 2 skips are the long-standing `--spark`-gated live-NIM / pgvector tests.
67
+
68
+ ### Build / packaging
69
+
70
+ - `pyproject.toml` `[tool.hatch.build.targets.wheel].include` extended to pick up `src/fieldkit/**/data/**/*.yaml`. Seed YAMLs dropped into the package data dir are now wheel-resident.
71
+
72
+ ### Added — `fieldkit.training` v0.5 build-out (Phase D: `probe.ReasoningProbe` + `ProbeReport.compare(normalize_budget=True)`)
73
+
74
+ Fourth module of the v0.5 `fieldkit.training` build-out, after Phase A's `recipe.TrainRecipe` (commit `bee458d`), Phase B's `convert` (commit `9f2a59f`), and Phase C's `run` + `merge_and_export` + `standardize_hf_export` (commit `2e142e7`). Lifts `scripts/probe_reasoning.py` + `scripts/compare_probes.py` into a reusable library surface, with the budget-normalization knob the NeMo-vs-Unsloth bakeoff (session 2026-05-21) discovered the hard way: lanes run at different `max_new_tokens` and a naive overall-aggregate compare gives the higher-budget lane an unearned chain-length advantage.
75
+
76
+ - **`fieldkit.training.ReasoningProbe`** — orchestrator. Construct from a sequence of `ProbeQuestion`, or load JSONL via `ReasoningProbe.from_jsonl(path)`. `run(model_id, *, lora_path, step, max_new_tokens, temperature, generator, on_progress)` returns a `ProbeReport`. Default `generator` lazy-imports `torch` + `transformers` (+ `peft` when `lora_path` is set) and loads bf16 on `cuda:0` with `attn_implementation="sdpa"` — the same shape as `scripts/probe_reasoning.py`. Pass a `generator=fn(ProbeQuestion) -> str` callable to bypass the load entirely (test seam + the legitimate prod knob for callers with a pre-loaded model).
77
+ - **`fieldkit.training.ProbeReport`** — bag of `ProbeRow` plus run-metadata, with `overall` / `by_category` aggregates as properties. `with_budget(cap)` returns a new report excluding any row whose `<think>` chain exceeds the cap (rows with `has_think=False` preserved; new `max_new_tokens` is `min(self, cap)`; dropped qids appended to `excluded_qids`). `to_json(path)` / `ProbeReport.from_json(path)` round-trip the canonical JSON shape — matches what `scripts/probe_reasoning.py` already writes so existing artifacts (`probes/baseline.json`, `probes/patent-strategist-v3-*.json`) load directly. Tolerant of the legacy `think_quality_score` key on load — LLM-judge scoring is owned by an in-CC-session orchestrator skill per `[[feedback_llm_skill_pattern]]`.
78
+ - **`ProbeReport.compare(other, *, normalize_budget=True, thresholds=None, baseline_label=None, current_label=None)`** — runs the spec §4 Layer 5 pass/fail check (`think_presence_rate` ≥ 90%, `think_token_length` ≥ 75%). With `normalize_budget=True` (default), if the two reports ran at different `max_new_tokens` any qid whose chain exceeds the smaller cap in EITHER report is excluded from BOTH before per-metric ratios are recomputed — the bakeoff's exact apples-to-apples fix. Excluded qids surface on `CompareResult.excluded_qids` for footnoting. `thresholds` accepts a custom `CompareThresholds`; `baseline_label` / `current_label` override the auto-derived model-id labels (use `"unsloth"` / `"nemo"` for lane bakeoffs).
79
+ - **`fieldkit.training.parse_think(response)`** — pure helper that picks the longest `<think>...</think>` pair from a response. R1-distill models occasionally false-start with an empty `<think></think>` before the real chain; the non-greedy regex alone would match the empty pair first (caught on smoke-step-200 row 14 of the patent-strategist v1 lineage). Char-quarter token approximation.
80
+ - **`fieldkit.training.summarize_rows(rows)`** — pure-python aggregator. `think_presence_rate` over all rows; `think_token_length` over `has_think=True` rows only (matches the standalone runner's `summarize()`). Re-runnable after any filter for subset summaries.
81
+ - **`ProbeQuestion` / `ProbeRow` / `ProbeSummary`** — frozen dataclasses. `ProbeQuestion` keeps `source` / `license` / arbitrary `metadata` pass-throughs from the probe-set JSONL so provenance survives the round-trip. `ProbeRow` is the per-question result (`qid`, `category`, `response`, `has_think`, `think_n_tok`, `think_text`, `wall_seconds`). `ProbeSummary` is what `summarize_rows` + `ProbeReport.overall` return.
82
+ - **`CompareThresholds` / `CompareRow` / `CompareResult`** — frozen dataclasses for the compare surface. `DEFAULT_COMPARE_THRESHOLDS` is the module-level singleton (presence 0.90, length 0.75 — the spec §4 Layer 5 defaults).
83
+ - **`ProbeError`** — distinct exception class so callers selectively catch probe-layer failures.
84
+ - **`THINK_REGEX`** — the compiled `<think>(.*?)</think>` pattern, exposed for callers that re-parse cached responses (e.g. the LLM-judge sidecar described in `[[feedback_llm_skill_pattern]]`).
85
+
86
+ ### Test suite (Phase D)
87
+
88
+ **+56 new tests** in `tests/test_training_probe.py`:
89
+
90
+ - `parse_think` — no block, single block, empty (false-start) block, longest-of-multiple pickup, multiline DOTALL handling, `THINK_REGEX` export.
91
+ - `summarize_rows` — empty input zeros, all-think mean math, mixed-presence math, no-present-rows zero length.
92
+ - `ProbeQuestion` / `ProbeRow` — frozen dataclass enforcement, metadata default.
93
+ - `ProbeReport` — `max_new_tokens<=0` rejection, `overall` + `by_category` math, repr.
94
+ - `ProbeReport.with_budget` — over-cap exclusion, has_think=False rows preserved, no-op below cap, lower-budget preserved on cap > self, `cap<=0` rejection, excluded-qids composition across calls.
95
+ - `ProbeReport.compare` — same-budget pass + default labels, custom label overrides, presence-drop FAIL, custom thresholds enable pass, normalize-budget exclusion (the bakeoff case), normalize is no-op on same budget, `normalize_budget=False` direct compare, skip on zero baseline, per-category breakdown captured, `DEFAULT_COMPARE_THRESHOLDS` value lock.
96
+ - `ProbeReport.to_json` / `from_json` — canonical-shape dict, round-trip through disk, missing-file / bad-JSON / missing-key error paths, legacy `think_quality_score` key tolerated on load (recomputed from rows).
97
+ - `ReasoningProbe.from_jsonl` — required-key load, optional pass-throughs collected into metadata, missing-file / malformed-line / missing-key / empty-file rejection, blank lines skipped.
98
+ - `ReasoningProbe.run` — fake-generator path, `lora_path` / `step` / `max_new_tokens` / `temperature` round-trip, `on_progress` callback per-question, `max_new_tokens<=0` rejection, `wall_seconds` recorded, no-think response handled, empty question-list rejection, `__len__`.
99
+
100
+ All pure-python; no torch / transformers / peft / live model needed. The real generator path is exercised by hand in production (the existing `scripts/probe_reasoning.py` already validates that surface). Total suite: **657 passed, 2 skipped** offline (`pytest -q`, `/tmp/fk` venv) — up from 601 at Phase C landing. The 2 skips are the long-standing `--spark`-gated live-NIM / pgvector tests.
101
+
102
+ ### Added — `fieldkit.training` v0.5 build-out (Phase C: `run` + `merge_and_export` + `standardize_hf_export`)
103
+
104
+ Marquee module of the v0.5 `fieldkit.training` build-out, after Phase A's `recipe.TrainRecipe` (commit `bee458d`) and Phase B's `convert` (commit `9f2a59f`). Symmetric LoRA SFT driver across the NeMo Framework and Unsloth backends, with poll-disk liveness baked in and the BF16-clean export transformation that the patent-strategist v3 NeMo lane discovered the hard way (session 2026-05-21) lifted out of one-shot bash and into the library so the next lane doesn't repeat the discovery.
105
+
106
+ - **`fieldkit.training.run(recipe, *, mode, poll_interval, on_progress, runner, sleep)`** — recipe → backend command → subprocess → poll-disk liveness → `TrainResult`. Builds the backend-specific `docker exec` command from a `TrainRecipe` (NeMo: `scripts/p65_train_nemo_lora.py` with the same flag set the bash orchestrator uses; Unsloth: `recipe.extra_env['TRAIN_SCRIPT']` with recipe fields passed as env vars). Polls `<run_dir>/latest_checkpointed_iteration.txt` + `iter_NNNNNNN/` directories — the *only* reliable progress signal under docker-exec + shell-redirect, where `train.log` can lag the process by 4+ hours per `[[feedback_megatron_train_log_buffering]]`. Defaults to a synchronous `subprocess.run` runner; injectable for tests and for async (nohup-style) launchers. Run-dir layout owned here: `<output_dir>/runs-smoke/` for smoke, `<output_dir>/runs-full/` for full.
107
+ - **`fieldkit.training.merge_and_export(recipe, *, iter, expect_iter, standardize, tokenizer_class_remap, runner)`** — merge a LoRA adapter into base + export to HF BF16 + bake in the BF16-clean transformation. **NeMo:** invokes Megatron-Bridge's `merge_lora.py` + `convert_checkpoints.py export` and stages the merged checkpoint to `<output_dir>/merged-mcore/` + the HF export to `<output_dir>/merged-hf-bf16/`. Mirrors `scripts/p65_merge_and_probe.sh` stages 1/2. **Unsloth:** invokes the caller-supplied `recipe.extra_env['MERGE_SCRIPT']` with `BASE_MODEL` / `LORA_CKPT` / `MERGED_HF` env vars. Then always (unless `standardize=False`) runs `standardize_hf_export` so the output is consumer-ready for `huggingface_hub.upload_large_folder`, `convert_hf_to_gguf.py`, and `fieldkit.publish.publish_quant`. Resolves the LoRA iter from `latest_checkpointed_iteration.txt` by default; explicit `iter=` overrides; `expect_iter=` catches early-stopped runs the same way `p65_merge_and_probe.sh` did.
108
+ - **`fieldkit.training.standardize_hf_export(hf_dir, *, tokenizer_class_remap)`** — pure-python helper that bakes in the two known NeMo-export quirks: (1) shard names like `model-NNNNN-of-000002.safetensors` get renamed to the HF-standard `model-NNNNN-of-00002.safetensors` width (`max(5, len(str(total_shards)))` digits) with matching `model.safetensors.index.json` rewrite — per `[[feedback_nemo_export_shard_numbering]]`; and (2) `tokenizer_config.json`'s `tokenizer_class` field is rewritten via lookup table (default `DEEPSEEK_TOKENIZER_CLASS_REMAP`: `TokenizersBackend` → `LlamaTokenizer`) — per `[[feedback_nemo_export_tokenizer_class_quirk]]`. Idempotent; tolerant of missing index and missing tokenizer config; raises `MergeExportError` only on malformed inputs (rename collision, non-JSON index, etc.). Pass `tokenizer_class_remap={}` to disable the tokenizer fix.
109
+ - **`fieldkit.training.poll_run_progress(run_dir)`** — pure-python helper that reads `latest_checkpointed_iteration.txt` + scans for `iter_NNNNNNN/` directories. Returns `(latest_iter, sorted_iter_dirs)`. Used internally by `run()`; surfaced as a public function so callers can build their own progress monitors. `(0, [])` on a non-existent run dir is the documented quiescent state.
110
+ - **`fieldkit.training.DEEPSEEK_TOKENIZER_CLASS_REMAP`** — the default `tokenizer_class` remap dict, exposed for inspection / extension. Currently `{"TokenizersBackend": "LlamaTokenizer"}`. Other model families that surface a similar export quirk can extend the table by passing a merged dict to `standardize_hf_export`.
111
+ - **`TrainResult` / `MergeExportResult`** — frozen dataclasses returned from `run()` / `merge_and_export()`. `TrainResult` carries `(backend, mode, run_dir, final_iter, wall_seconds, container, log_path, iter_dirs)`. `MergeExportResult` carries `(backend, source_iter, merged_hf_dir, merged_mcore_dir, tokenizer_class_remapped, shard_renames, standardize_applied)`. Both are hashable; safe to drop into a lineage row.
112
+ - **`TrainError` / `MergeExportError`** — distinct exception classes so callers can selectively catch launch-time + runtime training failures vs merge / export / standardize failures. Both are `RuntimeError` subclasses.
113
+
114
+ ### Test suite (Phase C)
115
+
116
+ **+38 new tests** in `tests/test_training_run.py`:
117
+
118
+ - `poll_run_progress` — missing dir, empty dir, latest file only, sorted iter dirs, non-iter siblings ignored, unparseable latest file degrades to 0.
119
+ - `standardize_hf_export` shard-rename branch — over-padded shards renamed (+ index rewrite), idempotent re-run, already-standard shards untouched, genuine 5-digit totals left alone, missing dir errors, malformed index errors, missing index tolerated.
120
+ - `standardize_hf_export` tokenizer-class branch — default remap fires, unmapped class left alone, empty remap dict disables fix, missing tokenizer config tolerated, exported constant value locked.
121
+ - `run` — NeMo full + smoke modes (with `--train-iters` vs `--smoke` flag verification), `extra_env` overrides forwarded into the docker-exec command, Unsloth requires `TRAIN_SCRIPT`, Unsloth with `TRAIN_SCRIPT` produces ps-train-targeted command + env vars, non-zero runner rc raises `TrainError`, bad mode + negative poll-interval rejected, `on_progress` callback fires, async runner + poll loop, recipe-preflight failure surfaces as `TrainError`.
122
+ - `merge_and_export` — NeMo end-to-end with shard rename + tokenizer remap baked in, explicit `iter` overrides resolution, `expect_iter` mismatch guard, missing-iter clear error, `standardize=False` skips cleanup, Unsloth requires `MERGE_SCRIPT`, Unsloth with `MERGE_SCRIPT` produces one-shot docker-exec, `MergeExportResult` frozen, `TrainResult` frozen.
123
+
124
+ All pure-python; no torch / docker / megatron-bridge / live container needed. Backend shell-outs are exercised via a `_FakeRunner` / `_MergeExportRunner` injection that records the command and writes synthetic `iter_NNNNNNN/` directories. Total suite: **601 passed, 2 skipped** offline (`pytest -q`, `/tmp/fk` venv). The 2 skips are the long-standing `--spark`-gated live-NIM / pgvector tests.
125
+
126
+ ### Added — `fieldkit.training` v0.5 build-out (Phase B: `convert`)
127
+
128
+ Second module of the v0.5 `fieldkit.training` build-out, after Phase A's `recipe.TrainRecipe` (commit `bee458d`). Absorbs the two patches that the patent-strategist v3 NeMo lane (Phase 6.5, articles `patent-strategist-bakeoff-unsloth-vs-nemo-framework`) discovered the hard way during session 2026-05-21 — so the next lane doesn't repeat the discovery.
129
+
130
+ - **`fieldkit.training.HFToMegatron`** — frozen dataclass wrapping `megatron.bridge.AutoBridge` with the YARN-rope-defaults fix baked in. Mirrors `scripts/p65_convert_hf_to_mcore.py`; replaces the hand-written script for any future YARN-rope HF model (DeepSeek-R1-Qwen3, Qwen3 extended-ctx, ...) headed for NeMo training. Lazy-imports `torch` + `megatron.bridge` — module import has no GPU cost and pure-inference dev envs stay clean. Run inside `nvcr.io/nvidia/nemo:26.04.00`; outside that envelope `.run()` raises `ConvertError` with a clear pointer.
131
+ - **`patch_yarn_defaults(provider)`** — the load-bearing helper, also exported. Sets `yarn_beta_fast=32.0` / `yarn_beta_slow=1.0` / `yarn_mscale=1.0` / `yarn_mscale_all_dim=0.0` / `yarn_correction_range_round_to_int=True` (from `megatron.core.models.common.embeddings.yarn_rotary_pos_embedding`) on a provider whose YARN fields the bridge left as `None`. Idempotent — re-running after a successful patch is a no-op. Pure-python, offline-testable with a duck-typed `SimpleNamespace`. The `YARN_DEFAULTS` constant is also exposed for inspection.
132
+ - **`register_llama_cpp_pretokenizer_hash(...)`** — idempotent string-patcher for llama.cpp's `convert_hf_to_gguf.py`. Inserts a 3-4 line block into the `get_vocab_base_pre` `if chkhsh == "...":` chain so future tokenizers (e.g. DeepSeek-R1-0528-Qwen3-8B) work without waiting for upstream merges. Returns `True` on insertion, `False` if the hash is already present, raises `ConvertError` on malformed inputs or a mis-pointed script. Re-apply after a fresh `git pull` on the llama.cpp checkout. The DeepSeek-R1 case lives in the module as `DEEPSEEK_R1_0528_QWEN3_TOKENIZER_HASH` so the next caller doesn't have to re-find it.
133
+ - **`ConvertError`** — distinct exception class so callers can selectively catch convert-stage failures vs other runtime exceptions.
134
+
135
+ ### Test suite
136
+
137
+ **+16 new tests** in `tests/test_training_convert.py`:
138
+
139
+ - `patch_yarn_defaults` — happy path (all five fields patched), non-YARN provider skipped, already-set values preserved, idempotency, missing-attrs handled.
140
+ - `register_llama_cpp_pretokenizer_hash` — insertion vs idempotent no-op, preserves existing chain blocks byte-identical, rejects non-hex / short hashes, raises on missing file / missing chain pattern.
141
+ - `HFToMegatron` — dataclass shape (frozen, default `torch_dtype='bfloat16'`), `.run()` raises clear `ConvertError` when megatron-bridge is missing (`/tmp/fk` venv path — guarded with `try: import megatron.bridge` so the test is a no-op on a real `nemo-train` env).
142
+ - `DEEPSEEK_R1_0528_QWEN3_TOKENIZER_HASH` constant format check.
143
+
144
+ All pure-python; no torch / megatron-bridge / live llama.cpp checkout needed. Total suite: **563 passed, 2 skipped** offline (`pytest -q`, `/tmp/fk` venv). The 2 skips are the long-standing `--spark`-gated live-NIM / pgvector tests.
145
+
146
+ ### Added — `fieldkit.training` v0.5 build-out (Phase A: `recipe.TrainRecipe`)
147
+
148
+ First module of the v0.5 build-out — the declarative scaffold every later phase consumes. Captures what `scripts/p65_train_nemo_lora.{py,sh}` previously spread across argparse + bash env vars in a single typed dataclass, so one recipe drives either lane (NeMo or Unsloth) and offline preflight catches bad inputs before any container start.
149
+
150
+ - **`fieldkit.training.TrainRecipe`** — frozen dataclass capturing backend / base_model / dataset / lora_rank / lora_alpha / lora_target_modules / lora_dropout / lr / warmup_steps / total_train_iters / micro_batch / global_batch / seq_length / save_interval / output_dir / log_interval / extra_env / mode. `validate()` is offline (pure-python, no filesystem touch); `preflight()` adds filesystem-existence checks on output_dir's parent + dataset path. YAML round-trip via `to_yaml` / `from_yaml` works with or without pyyaml (hand-rolled flat-schema fallback so the v0.5 surface installs cleanly in pure-pip envs).
151
+ - **`fieldkit.training.lora_target_modules_for_backend(modules, backend)`** — maps HF target-module names (`q_proj` / `k_proj` / `v_proj` / `o_proj` / `gate_proj` / `up_proj` / `down_proj`) to Megatron-Bridge fused names (`linear_qkv` / `linear_proj` / `linear_fc1` / `linear_fc2`) at runtime so one recipe field drives either lane. Idempotent on already-mapped names.
152
+ - **`fieldkit.training.MODE_FULL` / `MODE_SMOKE`** — string constants for the recipe's `mode` field; used by `run()` (Phase C) to decide between `runs-smoke/` and `runs-full/` output layout.
153
+ - **`fieldkit.training.RecipeError`** — distinct exception class for recipe-stage validation failures.
154
+
155
+ The pre-existing v0.4.x RL primitives (`WeightDeltaTracker`, `LoraReferenceSnapshot`) continue to re-export from `fieldkit.training` unchanged.
156
+
157
+ ### Test suite (Phase A)
158
+
159
+ **+39 new tests** in `tests/test_training_recipe.py` — validate / preflight / YAML round-trip / target-module mapping / frozen-enforcement / mode constants. All pure-python; no torch / megatron-bridge / container needed. Total suite at Phase A landing: **547 passed, 2 skipped** offline (`pytest -q`, `/tmp/fk` venv) — up from 507 at v0.4.3.
160
+
161
+ ### Test suite (cross-phase total)
162
+
163
+ | Phase | New tests | Cumulative | Module |
164
+ |---|---|---|---|
165
+ | A | +39 | 547 | `recipe.TrainRecipe` + helpers |
166
+ | B | +16 | 563 | `convert.HFToMegatron` + pretokenizer registrar |
167
+ | C | +38 | 601 | `run` + `merge_and_export` + `standardize_hf_export` |
168
+ | D | +56 | 657 | `probe.ReasoningProbe` + `ProbeReport.compare` |
169
+ | E | +53 | 710 | `decide.train_backend` + `refresh` |
170
+ | **Total** | **+202** | **710 passed, 2 skipped** | `fieldkit.training` surface 7 → 46 `__all__` |
171
+
172
+ All pure-python (torch / transformers / megatron-bridge / docker lazy-imported, fake-runner injection for shell-out paths). The 2 skips are the long-standing `--spark`-gated live-NIM / pgvector tests, unchanged since v0.4.x. All 46 `__all__` symbols in `fieldkit.training` documented under `audit_docs.py --strict-kwargs`.
173
+
174
+ ### Verified on Spark
175
+
176
+ The v0.5 build-out was driven from the patent-strategist v3 paired bakeoff (Phase 6.5 of `specs/patent-strategist-v1.md`) — every module exercised against live infra during sessions 2026-05-21 → 2026-05-22:
177
+
178
+ - **`convert.HFToMegatron` + `patch_yarn_defaults`** — converted `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B` HF → Megatron-core inside `nvcr.io/nvidia/nemo:26.04.00`. YARN-rope-defaults landmine fixed in-library.
179
+ - **`convert.register_llama_cpp_pretokenizer_hash`** — registered the DeepSeek-R1-0528-Qwen3-8B BPE-pretokenizer hash (`0d75215...` → `qwen35`) into `/home/nvidia/llama.cpp/convert_hf_to_gguf.py`. Subsequent GGUF conversions stable across all 4 quants × 2 lanes.
180
+ - **`run` + `merge_and_export` + `standardize_hf_export`** — drove both the NeMo Framework lane (LoRA-SFT inside `nemo-train`, 8h 04m full-train wall) and the Unsloth lane (4-bit QLoRA, 10h 52m wall) end-to-end. Both lanes' LoRA adapters merged to BF16 HF; shard-rename + tokenizer-class fixes baked into the NeMo export (no more post-merge bash patching).
181
+ - **`probe.ReasoningProbe.compare(normalize_budget=True)`** — produced Article H's apples-to-apples chain-length headline (NeMo +44% patent-strategic mean chain). The budget-normalization knob excluded qids whose chain exceeded the smaller cap in either lane (Unsloth 1536 vs NeMo 2048) so the metric isn't inflated by the higher-budget side.
182
+ - **`decide.train_backend`** — first seed entry `2026-05-22-paired-bakeoff.yaml` shipped in `fieldkit/src/fieldkit/training/data/decide-entries/` (wheel-resident via the new package-data glob). End-to-end smoke green: `train_backend(base_model_family="qwen3-r1-distill", optimize_for="patent_chain_length")` → `nemo`.
183
+
184
+ ### Artifacts
185
+
186
+ Phase 6.5 paired-bakeoff outputs (live on HuggingFace under `Orionfold/`):
187
+
188
+ - [Orionfold/patent-strategist-v3-unsloth-GGUF](https://huggingface.co/Orionfold/patent-strategist-v3-unsloth-GGUF) — Q4_K_M / Q5_K_M / Q6_K / Q8_0
189
+ - [Orionfold/patent-strategist-v3-nemo-GGUF](https://huggingface.co/Orionfold/patent-strategist-v3-nemo-GGUF) — Q4_K_M / Q5_K_M / Q6_K / Q8_0
190
+ - [Orionfold/patent-strategist-v3-unsloth](https://huggingface.co/Orionfold/patent-strategist-v3-unsloth) — BF16
191
+ - [Orionfold/patent-strategist-v3-nemo](https://huggingface.co/Orionfold/patent-strategist-v3-nemo) — BF16
192
+
193
+ ### Articles in this release
194
+
195
+ - `articles/patent-strategist-bakeoff-unsloth-vs-nemo-framework/` — Article H, the marquee. Drives every Phase A–E module: `TrainRecipe` owns the lane recipes; `convert` carries the YARN + pretokenizer fixes; `run` + `merge_and_export` ran both lanes; `probe.compare(normalize_budget=True)` produced the +44% chain-length headline; `decide.train_backend`'s seed entry is the first row in the decide corpus.
196
+ - `articles/unsloth-on-the-spark-when-train-peak-equals-base-peak/` — Unsloth feasibility companion to the bakeoff (`fieldkit.training.run` drove the Unsloth lane).
197
+ - `articles/fine-tune-data-prep-decisions-on-spark/` — patent-strategist v2 corpus diagnosis (data-layer; doesn't depend on the v0.5 surface but ships in the same window).
198
+ - `articles/becoming-a-medical-curator-on-spark/` — vertical 4 medical card (uses `fieldkit.publish`; pre-v0.5 surface).
199
+
200
+ ## [0.4.3] — 2026-05-17
201
+
202
+ ### Added — `fieldkit.eval` patent-strategist scorer build-out (T6)
203
+
204
+ Four new scorers in `fieldkit.eval` round out the `format='patent-strategist'` branch landed in v0.4.2 (T4) and the `mcq_letter` promotion (T5), per `specs/patent-strategist-v1.md` §3.3:
205
+
206
+ - **`patent_claim_validity(predicted, expected, *, judge, rubric=None)`** — PatentScore-methodology 7-dim claim-validity scorer (novelty / non-obviousness / written-description / enablement / indefiniteness / subject-matter-eligibility / dependent-claim-structure). LLM-judge backed; caller supplies a `Judge(client=..., rubric=RUBRIC_PATENT_CLAIM_VALIDITY)`. Per-row `rubric` dict (e.g. `cited_prior_art`, `claim_type`) is rendered into a sorted, deterministic `Hints:` block fed to the judge as context. PatentScore methodology only — no data reuse from the cited paper (license unclear).
207
+ - **`office_action_argument(predicted, expected, *, judge, rubric=None)`** — 4-dim office-action-response scorer (rejection-type identification, statutory citation accuracy, argument structure, persuasiveness). Same `Judge`-wrapping shape; per-row hints like `rejection_type`, `required_citations`, `claim_count`, `relies_on_official_notice` flow through the `Hints:` block.
208
+ - **`irac_structure(predicted, expected="")`** — deterministic 4-checklist scorer for Patent-Bar-style IRAC responses. One regex per component (Issue / Rule / Application / Conclusion); returns `{0.0, 0.25, 0.5, 0.75, 1.0}` based on how many fire. Tolerant patterns — markdown headings, all-caps section labels, transition prose ("Whether…", "Under 35 USC 103…", "Here…", "Therefore…") all count. False positives are far less harmful than false negatives at quarter-granularity. The only T6 scorer that needs no network, so it's the one wired end-to-end through `VerticalBench` in the integration test.
209
+ - **`prior_art_relevance(predicted, expected) -> float`** — Spearman ρ on ranked prior-art lists, returning just the rho per spec §3.3. Tolerant parser accepts JSON arrays (`'["a","b","c"]'`), comma-separated, or newline-separated (with `1.`, `1)`, `- `, `* ` prefixes stripped) as well as `list[str]` directly. Missing-from-pred gold items get worst-rank padding so omissions still penalize. The paired-rank vectors are re-rankified before correlation so positional gaps from dup-skipping or padding collapse to contiguous ranks — without this, `["a","a","b","c"]` vs `["a","b","c"]` would yield ρ≈0.98 instead of the intuitive 1.0. **`prior_art_relevance_full`** returns the same rho plus an `mse_likert` field (populated only when both sides parse as numeric Likert vectors) and `n`, packaged as the frozen `PriorArtRelevanceResult` dataclass.
210
+
211
+ ### Added — rubric markdown bundled in the wheel
212
+
213
+ - **`fieldkit/src/fieldkit/eval/rubrics/{patent_claim_validity,office_action_argument}.md`** — system-prompt markdown shipped alongside the module. Loaded lazily via the new **`load_rubric(name)`** helper (and exposed via the **`RUBRIC_PATENT_CLAIM_VALIDITY`** / **`RUBRIC_OFFICE_ACTION_ARGUMENT`** module constants for the common case). `[tool.hatch.build.targets.wheel].include` extended with `src/fieldkit/eval/rubrics/*.md` so the markdown lands in the wheel.
214
+
215
+ ### Added — `fieldkit.eval.vertical` live-callable dispatch
216
+
217
+ - **`PATENT_STRATEGIST_SCORER_FNS: dict[str, Callable[..., float]]`** — companion to the existing string-keyed `PATENT_STRATEGIST_SCORERS` map. Resolves the four T6 scorers + the promoted `mcq_letter` to live functions (skips the two `judge_rubric` slots ("C", "E") which are open-ended `Judge.grade(...)` calls without a single named scorer fn). Drift-detection test asserts every fn's `__name__` matches the matching string-map entry.
218
+
219
+ ### Test suite
220
+
221
+ **+93 new tests** across three new test files + the existing vertical-bench test class:
222
+
223
+ - `tests/eval/test_irac_structure.py` — perfect / partial / per-component-detector coverage; quarter-granularity parametrize; whitespace-only / empty / expected-arg-ignored edges.
224
+ - `tests/eval/test_prior_art_relevance.py` — perfect / reversed / partial-overlap; string-parsing variants (JSON, comma, newline-numbered, bullet, paren-numbered); Likert MSE branch (perfect, off-by-one, length-mismatch fallback, non-numeric); dataclass shape (frozen, three fields); the known-value `n=4` swap (ρ=0.8) plus the dup-skip test that drove the `_rankify`-on-paired-vectors fix.
225
+ - `tests/eval/test_judge_backed_scorers.py` — `load_rubric` round-trip + missing-file error; `_format_rubric_hints` (empty / scalar / list-bullet / sorted-determinism / nested-dict JSON); both judge-backed scorers wired against a `_FakeJudge` fixture (no network) covering happy path, `None`-score fallback to `0.0`, rubric→`Hints:` threading, empty-reference collapse to `None`; signature-introspection tests ensuring `judge` and `rubric` stay keyword-only so `VerticalBench.scorer_kwargs` plumbing works.
226
+ - `tests/test_vertical_bench.py::TestPatentStrategistFormat` — 3 new tests: `PATENT_STRATEGIST_SCORER_FNS` resolves each key to the expected callable; name-map vs fn-map drift assertion; full end-to-end `VerticalBench.run` exercising `irac_structure` over a 2-row JSONL with one perfect and one half-formed IRAC response (mean accuracy = 0.75).
227
+
228
+ Total suite: **507 passed, 2 skipped** offline (`pytest -q`, `/tmp/fk` venv). The 2 skips are the long-standing `--spark`-gated live-NIM / pgvector integration tests.
229
+
230
+ ### Articles in this release
231
+
232
+ - `articles/becoming-a-patent-strategist-on-spark/` — patent-strategist v1.0 article (W3 publish target per spec §1 deliverables). T6's scorer build-out is the load-bearing dependency for the article's bench-comparison numbers; v0.4.3 is the version the article will pin against.
233
+
9
234
  ## [0.4.2] — 2026-05-15
10
235
 
11
236
  Patch release. Two card-rendering polish lifts on `fieldkit.publish` driven by the 2026-05-15 cyber-vertical cycle (`Orionfold/SecurityLLM-GGUF`, the third vertical card on this surface — zero fieldkit source changes between Saul / cyber, the v0.4.1 publishing surface generalized exactly as designed). Both lifts are additive (one new `ModelCard` field already shipped on `main` in `ff1b92f`; one new `ArtifactManifest` field added here). No new modules, no new public classes, no breaking changes — purely a tightening pass.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: fieldkit
3
- Version: 0.4.2
3
+ Version: 0.5.0
4
4
  Summary: Verified-on-Spark patterns lifted from the ai-field-notes blog into one importable Python package.
5
5
  Project-URL: Homepage, https://ainative.business/fieldkit/
6
6
  Project-URL: Source, https://github.com/manavsehgal/ai-field-notes/tree/main/fieldkit
@@ -54,6 +54,14 @@ from fieldkit.eval import (
54
54
  # v0.4.x — vertical-curator surface
55
55
  VerticalBench, VerticalQA,
56
56
  contains, exact_match, numeric_match,
57
+
58
+ # v0.4.3 — patent-strategist scorers
59
+ mcq_letter,
60
+ irac_structure,
61
+ prior_art_relevance, prior_art_relevance_full, PriorArtRelevanceResult,
62
+ patent_claim_validity, office_action_argument,
63
+ RUBRIC_PATENT_CLAIM_VALIDITY, RUBRIC_OFFICE_ACTION_ARGUMENT,
64
+ load_rubric,
57
65
  )
58
66
  ```
59
67
 
@@ -74,6 +82,10 @@ b.dump("benchmark.json") # full JSON
74
82
 
75
83
  Exceptions in the callable are caught and recorded with `success=False` so a single bad input doesn't sink the sweep. Pass `on_error="raise"` to abort on first failure.
76
84
 
85
+ `Bench.record(*, input=None, output=None, latency_ms, success=True, error=None, tags=None, **metrics)` is the imperative variant — use it when the wrapped function already returns its own latency breakdown (embed/retrieve/generate sub-timings) and you want to record those components without re-timing the wall clock. `output` is stashed for `include_outputs=True` dumps; `latency_ms` is the only required kwarg.
86
+
87
+ `Bench.to_dict(*, include_outputs=False)` and `Bench.dump(path, *, include_outputs=False)` both default to *eliding* the raw per-call outputs because benchmark JSON files balloon fast on long-context generations. Flip `include_outputs=True` when you need the model's actual response text for downstream auditing (e.g. feeding into `Judge` after the fact).
88
+
77
89
  ### `Judge(client: NIMClient, rubric=RUBRIC_CORRECTNESS, ...)`
78
90
 
79
91
  LLM-as-judge wrapping any `NIMClient`. Three built-in rubrics: `correctness`, `faithfulness`, `relevance`.
@@ -114,6 +126,8 @@ traj.cumulative_best() # list[float]
114
126
 
115
127
  Permissive parser drops malformed lines silently — the agent loop emits intermediate `proposed`/`failed` records too.
116
128
 
129
+ `Trajectory.repeat_rate(*, window=None)` returns a single float for the whole trajectory by default; pass `window=N` to get a per-window list of `{first, last, n, repeats, rate}` records — useful for showing the repeat rate climbing as the proposer's history horizon forgets older proposals. `Trajectory.mode_dominance(*, top_n=None)` returns *all* (knob, value) pairs by proposal count when `top_n=None`; pass `top_n=5` (or any int) to cap the list when the trajectory has long tails and you only care about the dominant modes.
130
+
117
131
  ### `is_refusal(text) -> bool`
118
132
 
119
133
  Catches "context does not contain the answer", "I do not know", "not specified", and other refusal patterns unioned from `rag-eval-ragas-and-nemo-evaluator` and `lora-on-your-own-qa-pairs`.
@@ -154,7 +168,7 @@ result = pak.score(
154
168
  print(result.pass_at) # {1: 0.7050, 8: 0.8415}
155
169
  ```
156
170
 
157
- `samples` is a sequence-of-sequences with one fixed sample count across problems; `PassAtK.score` raises if they diverge. `extras_fn(problem, samples) -> dict` is an optional hook for attaching per-problem metadata (first-sample tail, decode-token counts, etc.) onto each `per_task` row without bloating the grader interface.
171
+ `samples` is a sequence-of-sequences with one fixed sample count across problems; `PassAtK.score` raises if they diverge. `extras_fn(problem, samples) -> dict` is an optional hook for attaching per-problem metadata (first-sample tail, decode-token counts, etc.) onto each `per_task` row without bloating the grader interface. `task_id_field="task_id"` (default) names the key holding the canonical id; override when the bench uses `id`, `qid`, etc.
158
172
 
159
173
  When you've already graded the rollout offline (e.g. you have a `comparison.json` from a prior bench), use `pak.from_rows(rows)` with pre-counted `(task_id, n, passed)` triples to skip re-grading.
160
174
 
@@ -184,6 +198,8 @@ custom = AgentRun.from_record(
184
198
 
185
199
  `TurnDetail` keeps five canonical fields (`turn`, `action`, `duration_s`, `input_tokens`, `output_tokens`) and stuffs everything else from the source record into `extras` so the canonical accessors stay stable while bench-specific fields (`papers_retrieved`, `parse_errors`, `candidate_cfg`) survive round-tripping.
186
200
 
201
+ `AgentRun.from_record(raw, *, question_id_field, question_id_path, inference_path, status_field="status", wall_field="total_time", turns_field="turn_details", candidates_field="final_candidates")` exposes every field-name knob the AutoResearchBench parser hardcodes — override `status_field` / `wall_field` / `candidates_field` for benches that emit (say) `"final_status"` + `"wall_seconds"` + `"results"` instead. `AgentRun.to_dict(*, include_raw=False)` defaults to a compact summary; flip `include_raw=True` to preserve the full source record for provenance dumps (large — only do this when the dump is the source-of-truth artifact).
202
+
187
203
  Convenience accessors on `AgentRun` are pure derivations of `turns`: `tool_calls()` (action == "tool"), `tool_format_errors()` (action == "error"), `total_input_tokens()`, `total_output_tokens()`, `succeeded()` (status == "finished" AND ≥1 candidate). Override `succeeded()` for benches with different success semantics.
188
204
 
189
205
  `summarize_agent_runs(runs, label="...")` aggregates per-status counts plus `summarize_metric` rollups for `wall_seconds`, `turns`, `candidates`, `tool_calls`, `tool_format_errors`. Mirrors the JSON shape `articles/autoresearchbench-on-spark/scripts/analyze_run.py` writes — pass straight to `json.dumps`.
@@ -292,6 +308,52 @@ numeric_match("Revenue was $4.55B", "4.5B",
292
308
  | `contains(p, e)` | The model is asked to answer in prose and the reference is a key fact/number/phrase that must appear somewhere in the answer. |
293
309
  | `numeric_match(p, e, *, rel_tolerance=0.01)` | FinanceBench-style quantitative answers. Extracts the first number from each side (commas stripped), compares under relative tolerance. Defaults to ±1% per FinanceBench's grading convention. Returns 0.0 if either side has no parseable number — including refusals, so the refusal counter elsewhere doesn't need to gate this scorer. |
294
310
 
311
+ ### Patent-strategist scorers *(v0.4.3)*
312
+
313
+ Five scorers + two rubric constants land in v0.4.3 to round out the `format='patent-strategist'` branch of `VerticalBench`. Wire them through `VerticalBench(scorer=…, scorer_kwargs=…)` or import the live-callable dispatch map at `fieldkit.eval.vertical.PATENT_STRATEGIST_SCORER_FNS`. The 1-paragraph-per-scorer cheat sheet:
314
+
315
+ #### `mcq_letter(predicted, expected, *, strip_think=True) -> float`
316
+
317
+ MCQ letter scorer promoted from `scripts/g3_*.py` after three vertical-bench reuses (cybermetric, medmcqa, patent-strategist). Decision order: stripped one-letter (`"B"`), then `"answer: X"` / `"answer is X"` / `"option X"` / `"choice X"`, then first word-bounded `[A-D]`. Case-insensitive throughout. When `strip_think=True` (default), `<think>...</think>` blocks are regex-stripped *before* the three-step decision — keeps reasoning-trace verbosity on R1-distill family models from polluting the letter pick. The flag is a no-op regex on cyber/medical text without `<think>` tags, so existing callers flip the default on safely.
318
+
319
+ #### `irac_structure(predicted, expected="") -> float`
320
+
321
+ Deterministic 4-checklist Patent-Bar IRAC detector. Returns one of `{0.0, 0.25, 0.5, 0.75, 1.0}` based on Issue / Rule / Application / Conclusion regex hits. Tolerant patterns: markdown headings, all-caps section labels, transition prose (`"Whether…"`, `"Under 35 USC 103…"`, `"Here…"`, `"Therefore…"`) all count. `expected` is ignored — the scorer measures structural form, not factual agreement; kept in the signature for `VerticalBench` compatibility. False positives are far less harmful than false negatives at this granularity; the score's job is to flag *structural absence*, not grade rhetorical polish.
322
+
323
+ #### `prior_art_relevance(predicted, expected) -> float`
324
+
325
+ Spearman ρ between predicted and gold prior-art rankings — the bench-facing scalar per `specs/patent-strategist-v1.md` §3.3. Accepts `list[str]` directly or a tolerant string parse (JSON arrays `'["a","b","c"]'`, comma-separated `"a, b, c"`, or newline-separated with `1.` / `1)` / `- ` / `* ` prefixes stripped). Items missing from `predicted` get worst-rank padding so omissions still penalize. The paired-rank vectors get re-rankified before correlation so positional gaps from dup-skipping or padding collapse to contiguous ranks — without this, `["a","a","b","c"]` vs `["a","b","c"]` would yield ρ≈0.98 instead of 1.0.
326
+
327
+ #### `prior_art_relevance_full(predicted, expected) -> PriorArtRelevanceResult`
328
+
329
+ Returns the same ρ plus an `mse_likert` field (populated only when both sides parse as numeric Likert vectors, e.g. `"5,4,3,2,1"`) and an `n` count, packaged as a frozen `PriorArtRelevanceResult(spearman_rho, mse_likert, n)` dataclass. The bench surface uses `prior_art_relevance` because the scorer contract is `Callable[..., float]`; this full variant is for callers that want both metrics in a single pass.
330
+
331
+ #### `patent_claim_validity(predicted, expected, *, judge, rubric=None) -> float`
332
+
333
+ PatentScore-methodology 7-dim claim-validity scorer (novelty / non-obviousness / written-description / enablement / indefiniteness / subject-matter-eligibility / dependent-claim-structure). LLM-judge backed; caller supplies a `Judge` instance constructed with `rubric=RUBRIC_PATENT_CLAIM_VALIDITY`. Per-row `rubric` dict (convention keys: `cited_prior_art`, `claim_type`, `dependency_target`, `statutory_focus`) renders into a deterministic sorted `Hints:` block fed to the judge as context. Returns the parsed score, mapping `None` → `0.0` so bench accuracy-averaging stays well-defined. **PatentScore methodology only — no data reuse from the cited paper** (license unclear).
334
+
335
+ ```python
336
+ from fieldkit.eval import Judge, RUBRIC_PATENT_CLAIM_VALIDITY, patent_claim_validity
337
+ from fieldkit.nim import NIMClient
338
+
339
+ with NIMClient(base_url="http://localhost:8000/v1", model="...") as c:
340
+ judge = Judge(client=c, rubric=RUBRIC_PATENT_CLAIM_VALIDITY)
341
+ score = patent_claim_validity(
342
+ predicted_claim_text,
343
+ reference_claim_text,
344
+ judge=judge,
345
+ rubric={"cited_prior_art": ["US10987654", "US20210123456"]},
346
+ )
347
+ ```
348
+
349
+ #### `office_action_argument(predicted, expected, *, judge, rubric=None) -> float`
350
+
351
+ 4-dim office-action-response scorer (rejection-type identification, statutory citation accuracy, argument structure, persuasiveness). Same `Judge`-wrapping shape as `patent_claim_validity`; pair with `RUBRIC_OFFICE_ACTION_ARGUMENT`. Convention rubric keys: `rejection_type` (`102` / `103` / `112(a)` / `112(b)` / `101` / `double-patenting` / `restriction`), `required_citations` (list of expected MPEP/CFR/case cites), `claim_count`, `relies_on_official_notice`.
352
+
353
+ #### Rubric loader: `load_rubric(name) -> str`
354
+
355
+ The two `RUBRIC_PATENT_CLAIM_VALIDITY` and `RUBRIC_OFFICE_ACTION_ARGUMENT` module constants are populated at import time from markdown files shipped under `fieldkit/eval/rubrics/`. Pass `load_rubric("patent_claim_validity")` to re-read the file (or your own rubric named `my_rubric.md` if you ship a fork). The `[tool.hatch.build.targets.wheel].include` glob ships `*.md` under that subtree, so the rubrics travel with the wheel.
356
+
295
357
  ## Samples
296
358
 
297
359
  - [`samples/bench-rag.py`](https://github.com/manavsehgal/ai-field-notes/blob/main/fieldkit/samples/bench-rag.py) — offline `Bench` + `Judge.parse` walkthrough.
@@ -136,6 +136,8 @@ result.logged_calls # the upload_folder kwargs that would have fired
136
136
 
137
137
  Token resolution order: explicit `token=` arg → `HF_TOKEN` env → `HUGGING_FACE_HUB_TOKEN` env → `huggingface_hub`'s cached login. If all four are absent and `dry_run=False`, `HFAuthError` raises before the network call.
138
138
 
139
+ `HFHubAdapter.push_folder(*, repo_name, commit_message="Initial Orionfold upload", private=False, repo_type="model")` exposes the three llama-hub kwargs the orchestrator passes through. `commit_message` defaults to the bootstrap value used by every first-push card — override on subsequent updates (`"Polish llama_cpp_example_prompt"`, `"Add Q4_0 variant"`). `private=True` creates a private repo first (or no-ops if the repo already exists at any visibility — `exist_ok=True` is baked in). `repo_type="model"` covers every Orionfold card; flip to `"dataset"` or `"space"` for the rare cases (lineage-store snapshots have shipped as datasets in past sessions).
140
+
139
141
  ### `publish_quant(*, quant_report, base_model, repo_name, staging_dir, ...) → PublishResult`
140
142
 
141
143
  The one-line orchestrator. Reads the duck-typed `quant_report` fields (`.format`, `.variants`, `.perplexity`, `.tokens_per_sec`, `.sustained_load_minutes`, `.variant_files`, `.vertical_eval`, `.vertical_eval_name`, `.model_license`, `.chat_format`, `.recommended_variant`, `.llama_cpp_example_prompt`), builds a `ModelCard`, stages the README + variant files, writes the `ArtifactManifest` (if `artifacts_dir` supplied), and invokes `HFHubAdapter.push_folder()`. Explicit kwargs override duck-typed report attrs.
@@ -166,6 +168,8 @@ result.hf_url # None in dry-run; set after live push
166
168
 
167
169
  The `model_license` / `chat_format` / `recommended_variant` kwargs landed in v0.4.x after the `Orionfold/finance-chat-GGUF` dry-run surfaced two card-rendering bugs: a hardcoded `license: apache-2.0` (wrong for the Llama-2 lineage AdaptLLM base) and an empty `## How to run` section (when no ollama handle or transformers snippet was supplied, the section header rendered with no body). Both are now caller-controlled with sane defaults.
168
170
 
171
+ `extra_tags=("finance", "evidence-based")` threads additional HF tags into the rendered card's frontmatter `tags:` array (deduplicated against the auto-generated tags like `gguf`, `quantized`, `orionfold`). Use for vertical-specific discoverability — the four shipped Orionfold cards each add their vertical name (`finance`, `legal`, `cyber`, `medical`) plus secondary tags driven by the base model's lineage.
172
+
169
173
  ## Why this surface
170
174
 
171
175
  Three things to notice. First, `HFHubAdapter` defaults to dry-run because the right workflow is dry-run → human review → live push. Library users who want a one-shot live push pass `dry_run=False` explicitly; library users who want the staging artifact for review (the common case during development) get it for free. The `hf-publisher` skill (`/home/nvidia/.claude/skills/hf-publisher/`) wraps this workflow as a triggered Claude Code surface.
@@ -63,27 +63,35 @@ print(report.variant_files["Q4_K_M"])
63
63
 
64
64
  If the source isn't already a GGUF, `quantize_gguf` first invokes `convert_hf_to_gguf.py --outtype f16` to produce a base F16 file, then runs `llama-quantize` per variant against that intermediate. The intermediate is reused as the F16 variant of the final report — no double-conversion. `dry_run=True` enumerates the subprocess commands into `report.notes` without running them; this is the path tests + CI use to verify the orchestration without needing an 8 GB checkpoint on hand.
65
65
 
66
- ### `measure_perplexity_gguf(gguf, *, corpus, paths, n_ctx=512)`
66
+ Pass `f16_path=` to skip the convert step entirely when you've already produced an F16 GGUF out-of-band (e.g. a prior partial run) — `quantize_gguf` then uses that file as the intermediate and only runs `llama-quantize` per remaining variant. Pass `extra_quantize_args=("--imatrix", "/path/to/imatrix.dat")` (or any other llama.cpp flag tuple) to thread additional arguments to every `llama-quantize` invocation; useful for K-quant importance-matrix calibration.
67
67
 
68
- Wraps `llama-perplexity`. Returns a `float` parsed from the canonical `Final estimate: PPL = N.NNN` line, or `None` on parse failure. Cards that ship without a perplexity column use the `None` path — the rendering is forgiving (the column shows `—`).
68
+ ### `measure_perplexity_gguf(*, gguf_path, corpus_path, paths=None, extra_args=(), dry_run=False)`
69
+
70
+ Wraps `llama-perplexity`. Returns a `float` parsed from the canonical `Final estimate: PPL = N.NNN` line, or `None` on parse failure / dry-run. Cards that ship without a perplexity column use the `None` path — the rendering is forgiving (the column shows `—`).
69
71
 
70
72
  ```python
71
73
  ppl = measure_perplexity_gguf(
72
- "/home/nvidia/data/quants/finance-chat/model-Q4_K_M.gguf",
73
- corpus="/home/nvidia/data/calibration/wikitext-2-raw-v1/wiki.test.raw",
74
+ gguf_path="/home/nvidia/data/quants/finance-chat/model-Q4_K_M.gguf",
75
+ corpus_path="/home/nvidia/data/calibration/wikitext-2-raw-v1/wiki.test.raw",
74
76
  paths=paths,
75
77
  ) # → 6.2215
76
78
  ```
77
79
 
78
- ### `measure_tokens_per_sec_gguf(gguf, *, paths, metric='tg', n_gpu_layers=99)`
80
+ Pass `extra_args=("-c", "1024", "--chunks", "20")` to thread additional flags to `llama-perplexity` (context window, chunk count, batch size, etc.). The default invocation uses llama.cpp's own defaults — bump `-c` when the calibration corpus's segments are longer than 512 tokens.
81
+
82
+ ### `measure_tokens_per_sec_gguf(*, gguf_path, paths=None, n_gen=128, n_prompt=512, extra_args=(), dry_run=False)`
79
83
 
80
- Wraps `llama-bench`. `metric='tg'` returns text-generation `tok/s`; `metric='pp'` returns prompt-processing `tok/s`. Returns `None` on parse failure.
84
+ Wraps `llama-bench` and returns `{"tg": tok/s, "pp": tok/s}` — both axes matter on a real Spark card (`tg` dominates interactive decode latency, `pp` dominates long-context ingestion). Either value may individually be `None` if `llama-bench`'s output for that build doesn't carry the corresponding row.
81
85
 
82
86
  ```python
83
- tg = measure_tokens_per_sec_gguf(gguf, paths=paths, metric='tg') # → 31.1
84
- pp = measure_tokens_per_sec_gguf(gguf, paths=paths, metric='pp') # → 1111.1
87
+ out = measure_tokens_per_sec_gguf(
88
+ gguf_path=gguf, paths=paths, n_gen=128, n_prompt=512,
89
+ )
90
+ # {'tg': 31.1, 'pp': 1111.1}
85
91
  ```
86
92
 
93
+ `n_gen=128` and `n_prompt=512` are llama-bench's `-n` / `-p` flags — number of generated tokens (drives `tg`) and prompt tokens (drives `pp`). Defaults match the canonical Orionfold-card numbers; bump `n_prompt` when measuring long-context regimes (4K / 8K). Pass `extra_args=("-t", "16")` or any other llama-bench flag tuple to override thread counts, batch sizes, etc.
94
+
87
95
  ### `ThermalProbe(interval_s=2.0, throttle_temp_c=83.0)`
88
96
 
89
97
  Pure-stdlib `nvidia-smi` poll loop. Spin one in a background thread for the duration of a measurement run; on `stop()` it returns sustained-load minutes (the wall-clock time before the first sample crossed `throttle_temp_c` or hit a `clocks_throttle_reasons.hw_thermal_slowdown` flag). Per the 2026-05-12 HANDOFF Q9 decision, every Orionfold card publishes this number.
@@ -74,8 +74,8 @@ with NIMClient(base_url="http://localhost:8000/v1",
74
74
  | `ingest(docs, chunk_tokens=900)` | int | Chunks via `fieldkit.nim.chunk_text`, embeds in batches of 32, upserts in one transaction. Returns chunk count. |
75
75
  | `retrieve(query, top_k=5)` | `list[Chunk]` | pgvector cosine `<=>`. Each chunk carries `distance`. |
76
76
  | `rerank(query, chunks, top_k=3)` | `list[Chunk]` | Pass-through when `rerank_url=None` so the simplest pipeline works without NGC creds. |
77
- | `fuse(query, chunks, **gen_kwargs)` | dict | Builds the strict-context prompt and calls the generator. |
78
- | `ask(query, retrieve_k=5, rerank_k=3, ...)` | dict | Full chain. Returns `{"answer", "chunks", "raw"}`. |
77
+ | `fuse(query, chunks, *, max_tokens=256, temperature=0.0, **gen_kwargs)` | dict | Builds the strict-context prompt and calls the generator. `max_tokens` / `temperature` flow to `NIMClient.chat` (default `temperature=0.0` is grounded-RAG-correct; bump only when you want the fuser to paraphrase). |
78
+ | `ask(query, *, retrieve_k=5, rerank_k=3, max_tokens=256, temperature=0.0)` | dict | Full chain. Returns `{"answer", "chunks", "raw"}`. `temperature=0.0` keeps the strict-context answer deterministic; raise it if the generator should hedge across multiple equally-grounded chunks. |
79
79
 
80
80
  ### Chunk id encoding
81
81