PyPI - fieldkit - Versions diffs - 0.3.0__tar.gz → 0.4.1__tar.gz - Mend

fieldkit 0.3.0tar.gz → 0.4.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (48) hide show

{fieldkit-0.3.0 → fieldkit-0.4.1}/CHANGELOG.md RENAMED Viewed

@@ -6,6 +6,105 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and
 ## [Unreleased]
+## [0.4.1] — 2026-05-14
+Patch release. The `fieldkit.eval.VerticalBench` overlay introduced in v0.4.0 needed two kwargs to score FinanceBench correctly (open-book context-prepend) and to bound a JSONL slice (subset filter on `question_type`). Both lifts came out of the 2026-05-13 V1 attempt on `AdaptLLM/finance-chat` (0/50 closed-book vs. 14–18%/50 open-book on the same JSONL) and the 2026-05-14 legal-curator scoring run on `Equall/Saul-7B-Instruct-v1`. The two scripts under `scripts/g3_*` that carried duplicated loaders now call into the package surface. No new modules, no new public classes — additive kwargs only.
+### Added — `fieldkit.eval.VerticalBench` open-book mode
+- **`VerticalBench.from_jsonl(..., open_book=...)`** — new kwarg. When `True`, FinanceBench rows have their `evidence[*].evidence_text` prepended to the question (templated as "Context from <doc>: …\n\nQuestion: …\n\nAnswer with just the numeric value.") so the model sees the 10-K excerpt the gold answer was derived from. Default `None` auto-resolves to `True` for `financebench` and `False` for `legalbench` / `generic` — the right defaults per benchmark convention. Lifts inline `_load_finbench_open_book` helpers from `scripts/g3_preflight_bench.py` and `scripts/g3_measure_variants.py` into the package surface; both scripts now call `VerticalBench.from_jsonl(open_book=True, subset=…)` instead of carrying duplicated loaders. The 2026-05-13 V1 attempt on AdaptLLM/finance-chat scored 0/50 closed-book and 14–18%/50 open-book on the same JSONL — open-book is the load-bearing flag for FinanceBench scoring.
+- **`VerticalBench.from_jsonl(..., subset=...)`** — new kwarg. FinanceBench-only convenience filter on the `question_type` column. Drops non-matching rows before the loader hits the `limit` cap, so callers can score the `metrics-generated` subset with `limit=50` and get 50 metrics-generated questions (not 50 mixed rows of which N are metrics-generated).
+### Test suite
+**+8 new tests** on `TestOpenBook` in `tests/test_vertical_bench.py` covering: auto-default for financebench, explicit `False` keeps closed-book, missing-evidence falls back to closed-book, legalbench / generic are no-ops, list-of-strings evidence shape, subset filter, subset × limit composition. Total: **375 passed, 3 skipped** offline (`pytest -q`). The 3 skips are the two `--spark`-gated live-integration tests + the `torch`-import skip in `test_training.py` (CPU-only venv).
+### Articles in this release
+- [`becoming-a-legal-curator-on-spark`](https://ainative.business/field-notes/becoming-a-legal-curator-on-spark/) — second Orionfold quant card, swaps FinanceBench for a curated 5-task LegalBench subset. Drives the `subset` kwarg's first non-finance use (LegalBench tasks via `legalbench` format) and validates that the `open_book` default-off branch is correct for LegalBench JSONLs.
+### Verified on Spark
+- **Live HF push:** `Orionfold/Saul-7B-Instruct-v1-GGUF` (5 GGUF variants + README, ~37 GB) shipped 2026-05-14 via the same `publish_quant(dry_run=False)` path the finance-chat card used a week earlier. Zero source changes in `fieldkit.publish` between the two pushes — the v0.4.0 surface generalized as designed.
+## [0.4.0] — 2026-05-14
+Fourth public release. Two new top-level modules (`fieldkit.publish` + `fieldkit.quant`) for the G3 GGUF / Quantization Publisher pick (MTBM Pick #1 per `ideas/mtbm-use-cases.md` §6), the v0.4.x **vertical-curator overlay** on `fieldkit.eval` (`VerticalBench`), and post-dry-run card-rendering fixes that landed the first live HF push (`Orionfold/finance-chat-GGUF`). The two new modules together unlock most of Cluster G; this cut implements the GGUF critical path and stubs the other quant formats with named entry points pointing at the v0.5+ roadmap.
+### Added — `fieldkit.publish` (new module)
+HuggingFace Hub adapter + auto model card builder from `fieldkit.lineage`. Three public surfaces:
+- **`fieldkit.publish.ModelCard`** — frontmatter + body builder. Renders the canonical card every Orionfold artifact gets: YAML frontmatter (license, library_name, base_model, pipeline_tag, tags, model_creator), a title + elevator, a **Spark-tested** block (per-variant perplexity + tok/s + thermal envelope), a variants table, **How to run** (`ollama pull` + `from_pretrained` snippets), an optional **Lineage** block (rendered from a `fieldkit.lineage.LineageStore` if provided), a **Methods** backlink to `ainative.business/field-notes/<slug>/`, and a footer attributing the publication to Orionfold LLC.
+- **`fieldkit.publish.ArtifactManifest`** — frozen dataclass for the `src/content/artifacts/<slug>.yaml` Phase-2 sync record (per memory `project_artifact_manifests_phase2`). `to_yaml()` emits via a hand-rolled stdlib emitter so the module has no runtime YAML dep. The source repo writes one of these per push; the Mac destination renders `/artifacts/<kind>/` catalog pages from `getCollection('artifacts')`.
+- **`fieldkit.publish.HFHubAdapter`** — lazy-`huggingface_hub` wrapper. Defaults to `dry_run=True` (stages files on disk, logs the would-be calls, no network). Flip `dry_run=False` to push via `HfApi().upload_folder(...)`. Token resolution order: explicit `token=` → `HF_TOKEN` env → cached login. The dry-run path is fully testable offline.
+Plus an orchestrator: **`fieldkit.publish.publish_quant(...)`** — one-line caller that ingests a `QuantReport`-shaped object (duck-typed; produced by `fieldkit.quant.quantize_gguf`), renders the card, writes the manifest, stages the variant files, and pushes (or dry-runs) the HF commit.
+Branded constants: `ORIONFOLD_BRAND = "Orionfold LLC"`, `ORIONFOLD_HF_HANDLE = "Orionfold"` (was `ORIONFOLD_HF_ORG = "orionfoldllc"` until 2026-05-14, when publishing moved to the existing user-account handle — Bartowski-shape personal handle precedent). Per the 2026-05-12 HANDOFF Q3 decision: Orionfold LLC is the parent brand for all AI-artifact publishing surfaces; repo names follow the Bartowski shape (`Orionfold/<model>-GGUF`, `Orionfold/<model>-LoRA`). `ORIONFOLD_HF_ORG` is retained as a back-compat alias pointing at the new constant; will be dropped at the next major cut.
+### Added — `fieldkit.quant` (new module)
+Quantization dispatcher. GGUF path implemented; AWQ/GPTQ/EXL3/MLX/NVFP4 declared as named stubs pointing at the roadmap.
+- **`fieldkit.quant.quantize_gguf(...)`** — wraps `llama.cpp/convert_hf_to_gguf.py` + `llama-quantize` to emit one GGUF file per requested variant (canonical Orionfold set: `Q4_K_M`, `Q5_K_M`, `Q6_K`, `Q8_0`, `F16`). Auto-derives F16 from a HF Transformers checkpoint when the source isn't already a GGUF. `dry_run=True` enumerates the would-be subprocess commands into `report.notes` without invoking them — used by tests and CI.
+- **`fieldkit.quant.measure_perplexity_gguf(...)`** — wraps `llama-perplexity`. Parses output via `parse_perplexity_output()` which recognizes the standard `Final estimate: PPL = N.NNN` shape and the lowercase `perplexity = N.NNN` fallback. Returns `None` on parse failure (cards ship without a perplexity column if measurement was skipped).
+- **`fieldkit.quant.measure_tokens_per_sec_gguf(...)`** — wraps `llama-bench`. Parses output via `parse_llama_bench_output()` for `tg` (text-gen, default) or `pp` (prompt-process) tok/s.
+- **`fieldkit.quant.ThermalProbe`** — pure-stdlib `nvidia-smi` poll loop. Reports sustained-load minutes before throttle, per the 2026-05-12 HANDOFF Q9 decision to publish duty-cycle limits on every Orionfold card.
+- **`fieldkit.quant.LlamaCppPaths`** — locator for `llama-quantize` / `llama-perplexity` / `llama-bench` / `convert_hf_to_gguf.py`. Env defaults: `LLAMA_CPP_BIN` directory, `LLAMA_CPP_CONVERT` script path. Override any field directly.
+- **`fieldkit.quant.QuantReport`** — canonical dataclass output. The contract `fieldkit.publish.publish_quant()` consumes.
+- **`fieldkit.quant.quantize_awq` / `quantize_gptq` / `quantize_exl3` / `quantize_mlx` / `quantize_nvfp4`** — named entry-point stubs. Raise `NotImplementedError` with a one-liner pointing at `ideas/mtbm-use-cases.md` §7. Locks the v0.4 public surface so v0.5+ implementations slot in without an API break.
+### Added — `fieldkit.eval.VerticalBench` (v0.4.x — vertical-curator overlay)
+Lightweight JSONL-loader wrapper around `fieldkit.eval.Bench` for vertical-domain accuracy scoring (FinanceBench / LegalBench / SemEval / generic). Drives the **vertical-curator pivot** announced 2026-05-13 (HANDOFF §2 + `ideas/mtbm-use-cases.md` §6 Pick #1.b + §8.5.1): every Orionfold quant card now ships with a vertical-domain accuracy axis, not just wikitext perplexity. Lives in `fieldkit/src/fieldkit/eval/vertical.py`; re-exported at the package root for `from fieldkit.eval import VerticalBench`.
+- **`fieldkit.eval.VerticalBench`** + **`VerticalQA`** — bench shape, JSONL loader, scorer plumbing. Accepts any `Callable[[str], str]` as the model function so subprocess (`llama-cli`), in-process (`llama-cpp-python`), or NIM-backed scoring all slot in. Per-call latency aggregates alongside accuracy + refusal via the underlying `Bench`.
+- **`fieldkit.eval.VerticalBench.from_jsonl(path, format='auto', ...)`** — auto-detects `financebench` / `legalbench` / `generic` JSONL shapes from the first row's field signature. Per-row metadata (company, doc_period, question_type, task) flows into per-call tags for slice-by aggregation downstream.
+- **Scorers** — `exact_match`, `contains`, `numeric_match` (with configurable `rel_tolerance`, default 1% — FinanceBench convention). The bench picks `numeric_match` by default for FinanceBench-shape JSONL, `exact_match` for LegalBench-shape.
+### Added — license + How-to-run defaults on `fieldkit.publish` (v0.4.x — `Orionfold/finance-chat-GGUF` dry-run found two card bugs)
+- **`ModelCard.license`** is now reachable from `publish_quant(..., model_license=...)` (and the duck-typed `quant_report.model_license` attribute). Previously the kwarg didn't exist and every card defaulted to `apache-2.0` — wrong for any Llama / Gemma / Qwen / CC-BY-NC base. AdaptLLM/finance-chat now correctly publishes with `license: llama2`.
+- **`ArtifactManifest.model_license`** mirrors the same value into the Astro manifest under `license.model:`. Astro Zod schema (`src/content.config.ts`) extended with `license.model: z.string().optional()` so destination catalog pages and HF badges stay in sync. The `license.tier:` field (commercial-distribution tier — `free` / `pro`) stays distinct from this upstream-license field.
+- **`ModelCard.hf_repo`** + **`ModelCard.chat_format`** + **`ModelCard.recommended_variant`** — three new fields that drive an auto-rendered default `## How to run` body. Before this fix, cards with no explicit `ollama_pull_handle` / `transformers_snippet` rendered an empty section header (the second finance-chat bug). The new renderer auto-builds three code blocks templated from `hf_repo` + a featured variant: `huggingface-cli download`, `llama-server` (OpenAI-compatible serve), and `llama-cpp-python` (in-process, threading `chat_format` if set). When all three new fields are absent + no explicit handle/snippet supplied, the section is omitted entirely (no more empty headers).
+- **`publish_quant(..., model_license=, chat_format=, recommended_variant=)`** kwargs added — orchestrate all three through to card + manifest. Same duck-typed fallback through `quant_report` attributes.
+- **`scripts/g3_build_first_quant.sh`** — `MODEL_LICENSE` / `CHAT_FORMAT` / `RECOMMENDED_VARIANT` env knobs added with case-statement overrides (`AdaptLLM/finance-chat → llama2 + llama-2`). Default `MODEL_LICENSE=apache-2.0` + `RECOMMENDED_VARIANT=Q5_K_M` for greenfield runs.
+- **`scripts/g3_push_first_quant.py`** (new) — one-shot live-push helper that reuses the existing dry-run stage (no 32 GB re-copy via `publish_quant(dry_run=False)`); calls `HFHubAdapter.push_folder()` directly. Bakes in xet-safety env (`HF_HOME=/home/nvidia/data/.hf-cache` + `HF_HUB_DISABLE_XET=1`) per the Spark-side `~/.cache/huggingface/` permission landmine; sources `HF_TOKEN` from `.env.local` (chmod 600).
+- **+11 tests** (full suite: 379 passed, 2 skipped offline). Covers: model_license override flow, default apache-2.0 fallback, default GGUF How-to-run rendering, `recommended_variant` override, `hf_repo`-less skip-section behavior, manifest `license.model` emission.
+### Added — vertical-eval surface on `fieldkit.publish`
+`ModelCard` + `ArtifactManifest` + `publish_quant(...)` extended to thread per-variant vertical-eval scores through to the rendered card and the Phase-2 sync manifest:
+- **`ModelCard.vertical_eval: dict[str, float]`** + **`ModelCard.vertical_eval_name: str`** — when set, the **Spark-tested** block renders a 5-column table (Variant / Size / Perplexity / tok/s / *Vertical-eval-name*) instead of the 4-column default, and the introductory copy switches from "measurement triple" to "measurement quad". Accuracy values render as percentages (`62.0%`). Cards without vertical eval render identically to v0.4.0 — backwards-compatible.
+- **`ArtifactManifest.vertical_eval` + `vertical_eval_name`** — written into the YAML manifest under the same key names. Mac destination Zod schema (`src/content.config.ts`) extended to accept both. Manifests without vertical eval skip the field entirely.
+- **`publish_quant(..., vertical_eval=, vertical_eval_name=)`** — explicit kwargs override whatever the duck-typed `quant_report` carries. Useful when scoring happens out-of-band from quantization (the canonical path on Spark: quantize 5 variants → measure each variant via `g3_measure_variants.py`, which calls `VerticalBench.run(llama_cli_fn)` and then feeds the resulting accuracy dict back into `publish_quant`).
+### Schema changes
+- `src/content.config.ts` — `FIELDKIT_MODULES` extended to include `'quant'` and `'publish'` in canonical order (`capabilities, nim, rag, eval, training, lineage, quant, publish, cli`).
+- `src/content.config.ts` — new `artifacts` Astro collection (Phase 2 sync contract). Loads YAML manifests from `src/content/artifacts/*.yaml`; Zod schema mirrors `fieldkit.publish.ArtifactManifest`. `ARTIFACT_KINDS` enum exposed alongside `FIELDKIT_MODULES` for downstream filtering. `src/content/artifacts/` directory created (empty + `.gitkeep`); first manifest will land when the first quant ships.
+- `src/content.config.ts` — `artifacts` schema extended with optional `vertical_eval: Record<string, number>` + `vertical_eval_name: string` (vertical-curator pivot 2026-05-13).
+### Test suite
+**130 new tests** across `tests/test_publish.py` (42, +16 from v0.4 scaffold incl. +11 for the model_license + How-to-run defaults fix), `tests/test_quant.py` (37), and `tests/test_vertical_bench.py` (39, new file), plus targeted regression coverage. Total: **379 passed, 2 skipped** offline (`pytest -q`). The 2 skips are `--spark`-gated live integration tests (chat NIM + pgvector); the v0.3 torch module-level skip has been resolved by lazy-importing torch only inside the training entry points. All new tests run offline — `dry_run=True` paths for `HFHubAdapter`, `publish_quant`, and `quantize_gguf` exercise the full code path without `huggingface_hub`, llama.cpp binaries, or `nvidia-smi` available. `VerticalBench` tests run without a model — `model_fn` is a callable, so a plain `lambda` exercises the full scoring + bench-aggregation path.
+### Articles in this release
+- [`becoming-a-gguf-publisher-on-spark`](https://ainative.business/field-notes/becoming-a-gguf-publisher-on-spark/) — G3 v0 anchor article. 3,388 words; documents the five-variant `Orionfold/finance-chat-GGUF` release end-to-end (Spark-tested perplexity / tok/s / sustained-load minutes / FinanceBench accuracy across F16, Q8_0, Q6_K, Q5_K_M, Q4_K_M) plus the V0 preflight-bench gate and the V1 chat-vs-continued-pretrain lesson. `hf_url:` frontmatter threads the live HF receipt onto the article.
+### Verified on Spark
+- **Live HF push:** `Orionfold/finance-chat-GGUF` shipped 2026-05-14 at <https://huggingface.co/Orionfold/finance-chat-GGUF> — 5 GGUF variants + auto-rendered README in 1h 57min. Repo returns HTTP 200, all 6 files present. `publish_quant(dry_run=False)` path exercised end-to-end.
+- **Five-variant measurement card** (F16 / Q8_0 / Q6_K / Q5_K_M / Q4_K_M) with the four Spark-tested axes — perplexity (wikitext-2), tg + pp tok/s (`llama-bench`), sustained-load minutes (`ThermalProbe` via `nvidia-smi`), and FinanceBench accuracy (n=50, `numeric_match`, open-book) — all produced via `fieldkit.quant.measure_*` + `fieldkit.eval.VerticalBench.run(...)` on GB10.
+### Deferred to v0.5
+- `fieldkit.image-lora` + `fieldkit.civitai` — Pick #2 (G9) prep. Deferred per the 2026-05-12 HANDOFF Q10 decision to sequence G3 → G9 rather than parallelize. Will land once G3 v0 proves the `fieldkit.publish` infra.
+- Non-GGUF formats in `fieldkit.quant` (AWQ, GPTQ, EXL3, MLX, NVFP4). The G3 v0 niche-positioning is Nemotron-family GGUFs with the Spark-tested layer; other formats are pure surface-area expansion and can wait for an audience signal.
 ## [0.3.0] — 2026-05-11
 Third public release. One new top-level module (`fieldkit.lineage`) lifted from the [auto-research-loop-on-spark article](https://ainative.business/field-notes/auto-research-loop-on-spark/) — the portable part of cxcscmu's *Auto-Research-Recipes* harness, decomposed into a pure-stdlib substrate any harness on the Spark can write into.

{fieldkit-0.3.0 → fieldkit-0.4.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: fieldkit
-Version: 0.3.0
+Version: 0.4.1
 Summary: Verified-on-Spark patterns lifted from the ai-field-notes blog into one importable Python package.
 Project-URL: Homepage, https://ainative.business/fieldkit/
 Project-URL: Source, https://github.com/manavsehgal/ai-field-notes/tree/main/fieldkit

{fieldkit-0.3.0 → fieldkit-0.4.1}/docs/api/capabilities.md RENAMED Viewed

@@ -80,6 +80,55 @@ practical_inference_envelope("70B params fp8")
 Raises `UnknownEnvelope` if no rule matches.
+### Supporting types
+The `Capabilities` view is composed of three frozen dataclasses. You normally read them off `Capabilities.load()` rather than constructing them directly, but the types are re-exported for type-hinting and structural pattern-matching.
+#### `Hardware`
+```python
+@dataclass(frozen=True, slots=True)
+class Hardware:
+    name: str                                  # "DGX Spark"
+    unified_memory_gb: int                     # 128
+    memory_topology: str                       # "unified CPU+GPU"
+    compute_arch: str                          # "GB10 Grace Blackwell"
+    supported_dtypes: tuple[str, ...]          # ("fp32", "bf16", "fp16", ...)
+    interconnect_to_other_gpus: str
+```
+Reachable as `Capabilities.load().hardware`. Use it to gate code paths on `unified_memory_gb` or `compute_arch` without re-parsing the JSON.
+#### `MemoryBudgetRulesOfThumb`
+```python
+@dataclass(frozen=True, slots=True)
+class MemoryBudgetRulesOfThumb:
+    param_bytes: dict[str, float]                       # mirrors DTYPE_BYTES
+    training_overhead_multiplier: str
+    kv_cache_per_token_per_layer: str
+    practical_inference_envelope: dict[str, str]        # {"8B params bf16": "..."}
+    practical_finetune_envelope: dict[str, str]
+```
+Backs `practical_inference_envelope()`. Inspect `caps.memory_budget_rules_of_thumb.practical_finetune_envelope` directly when you want the fine-tune table instead of the inference one.
+#### `StackEntry`
+```python
+@dataclass(frozen=True, slots=True)
+class StackEntry:
+    id: str                                              # "nim", "nemo", "trt-llm", ...
+    label: str
+    purpose: str
+    verified_in_articles: tuple[str, ...] = ()
+    known_limits: tuple[str, ...] = ()
+    fits_paper_shapes: tuple[str, ...] = ()
+    supported_models_at_spark_scale: tuple[str, ...] = ()
+```
+One entry per Spark-relevant stack component. `frontier-scout` uses `fits_paper_shapes` to decide whether a paper's training recipe matches a stack we have running notes for; the `verified_in_articles` tuple links back into ai-field-notes slugs that proved a given stack on the box.
 ### `DTYPE_BYTES`
 Bytes-per-parameter table:

{fieldkit-0.3.0 → fieldkit-0.4.1}/docs/api/cli.md RENAMED Viewed

@@ -2,7 +2,7 @@
 module: cli
 title: fieldkit (CLI)
 summary: A thin Typer wrapper over the modules. Quick checks and smoke benchmarks without writing Python.
-order: 6
+order: 9
 ---
 ## What it is
@@ -17,7 +17,7 @@ Print the installed package version.
 ```bash
 $ fieldkit version
-0.2.0
+0.4.0
 ```
 ### `fieldkit envelope <size>`

{fieldkit-0.3.0 → fieldkit-0.4.1}/docs/api/eval.md RENAMED Viewed

@@ -9,6 +9,12 @@ order: 4
 The eval harnesses the project keeps reinventing: a per-call latency benchmarker that emits the same JSON shape as `articles/*/evidence/benchmark.py`, an LLM-as-judge with the three rubrics from `rag-eval-ragas-and-nemo-evaluator`, a trajectory analyzer for agent-loop JSONL, and a refusal regex catalog unioned across the project's articles.
+**v0.4.x additions** (vertical-curator surface for the G3 GGUF publisher pipeline):
+- `VerticalBench` — Spark-overlay scorer for FinanceBench / LegalBench / SemEval-style JSONL test sets. Wraps `Bench`, so latency aggregates alongside accuracy and refusal. Network access lives in the caller (`llama-cli`, NIM, vLLM) — the bench itself is offline-only and unit-testable.
+- `VerticalQA` — one test case (qid + question + expected + tags) lifted from a vertical-eval JSONL.
+- `exact_match` / `contains` / `numeric_match` — the three built-in scorers. `numeric_match` is the FinanceBench default (first-number ±1% rel-tol); `exact_match` is the LegalBench default; `contains` is the right pick when the model answers in prose around a key fact.
 **v0.2 additions** (verifier-loop and agent-bench primitives):
 - `AssertionGrader` — pure file-system grader over five assertion primitives (`file_exists`, `file_not_exists`, `file_contents_contain`, `file_contents_match_regex`, `file_unchanged`). Lifted from `clawgym-on-spark`'s deterministic grader.
@@ -44,6 +50,10 @@ from fieldkit.eval import (
     # v0.2 — matched-base comparison
     MatchedBaseComparison, MatchedBaseComparisonResult, GroupStats,
+    # v0.4.x — vertical-curator surface
+    VerticalBench, VerticalQA,
+    contains, exact_match, numeric_match,
 )
 ```
@@ -218,6 +228,65 @@ json.dump(result.to_dict(), open("comparison.json", "w"), indent=2)
 `MatchedBaseComparison.stats(rows)` is exposed separately when you only need single-rollout aggregation (no comparison). Accepts a list/iterable of dicts or a JSONL path.
+### `VerticalBench(name, questions, scorer=exact_match, ...)` *(v0.4.x)*
+Spark-overlay scorer for vertical-domain test sets — FinanceBench, LegalBench, SemEval-style JSONL — that the G3 GGUF publisher pipeline uses as its fourth measurement axis alongside perplexity, tok/s, and sustained-load minutes.
+The bench is intentionally callable-shaped: it accepts a `model_fn(prompt) -> str` and times each call via the existing `Bench` harness, so latency aggregates alongside accuracy and refusal. Network access lives in the caller (llama-cli, NIM, vLLM), keeping the bench offline-only for unit tests.
+```python
+from fieldkit.eval import VerticalBench, numeric_match
+vb = VerticalBench.from_jsonl(
+    "financebench.jsonl",
+    scorer=numeric_match,         # FinanceBench → first-number ±1%
+    limit=50,
+)
+def model_fn(prompt: str) -> str:
+    return llama_cli_call(gguf_path, prompt)
+bench = vb.run(model_fn, extra_tags={"variant": "Q4_K_M"})
+print(bench.report())             # accuracy + refusal_rate + latency
+```
+`VerticalBench.from_jsonl(path, *, format="auto", limit=None, scorer=None, scorer_kwargs=None)` auto-sniffs FinanceBench / LegalBench / generic schemas from the first JSON row. Rows missing the question or expected field are silently dropped (the row-count delta vs the JSONL is the diagnostic). The default scorer is `numeric_match` for FinanceBench and `exact_match` everywhere else; pass `scorer=` to override.
+`VerticalBench.run(model_fn, *, limit=None, on_error="record", extra_tags=None)` returns the underlying `Bench` so callers route through the existing `.summary()` / `.report()` / `.dump()` pipeline. Each `BenchCall` carries `accuracy` (0.0/1.0 from the scorer) and `refusal` (0.0/1.0 from `is_refusal`) metrics; per-row metadata (company, doc_period, question_type) flows through to `BenchCall.tags` for downstream slice-by aggregation.
+`VerticalBench.summary()` produces a lightweight `{name, n, scorer, tag_keys}` dict without invoking the model — useful in the lineage entry recording *what* the bench will measure before the model has actually run.
+### `VerticalQA` *(v0.4.x)*
+```python
+@dataclass(frozen=True, slots=True)
+class VerticalQA:
+    qid: str                              # FinanceBench `financebench_id`, etc.
+    question: str
+    expected: str
+    tags: dict[str, Any] = field(default_factory=dict)
+```
+One vertical-eval test case. The `qid` is the row's stable id so per-row scores can be cross-referenced against the source JSONL; `tags` carry per-row metadata (company, doc_period, question_type) that flow through to `Bench` for slice-by aggregation downstream.
+### Scorers — `exact_match` / `contains` / `numeric_match` *(v0.4.x)*
+Pluggable `Callable[[predicted, expected], float]` returning 1.0 / 0.0. Pass any custom callable into `VerticalBench(scorer=...)`; the three built-ins cover the dominant patterns:
+```python
+exact_match("yes", "Yes")                          # 1.0 — whitespace + case-insensitive
+contains("The 2023 revenue was $4.5B.", "$4.5B")   # 1.0 — substring match
+numeric_match("Revenue was $4.55B", "4.5B")        # 1.0 — first number, ±1% rel-tol
+numeric_match("Revenue was $4.55B", "4.5B",
+              rel_tolerance=0.001)                 # 0.0 — tighter tol
+```
+| Scorer | When to use it |
+|---|---|
+| `exact_match(p, e)` | LegalBench-style single-label classification (`yes` / `no` / `hold` / `overrule`). Whitespace- and case-insensitive. |
+| `contains(p, e)` | The model is asked to answer in prose and the reference is a key fact/number/phrase that must appear somewhere in the answer. |
+| `numeric_match(p, e, *, rel_tolerance=0.01)` | FinanceBench-style quantitative answers. Extracts the first number from each side (commas stripped), compares under relative tolerance. Defaults to ±1% per FinanceBench's grading convention. Returns 0.0 if either side has no parseable number — including refusals, so the refusal counter elsewhere doesn't need to gate this scorer. |
 ## Samples
 - [`samples/bench-rag.py`](https://github.com/manavsehgal/ai-field-notes/blob/main/fieldkit/samples/bench-rag.py) — offline `Bench` + `Judge.parse` walkthrough.

{fieldkit-0.3.0 → fieldkit-0.4.1}/docs/api/nim.md RENAMED Viewed

@@ -62,6 +62,29 @@ chunks = chunk_text(long_doc, max_tokens=900)
 Polls `/models` until 200 or timeout. Returns `True` on success, `False` on timeout. Use it as the first call in any sample script that talks to a cold NIM.
+### `ChatMessage`
+Type alias for the OpenAI-style chat message shape `NIMClient.chat()` consumes:
+```python
+ChatMessage = dict[str, Any]
+# Concretely: {"role": "system" | "user" | "assistant", "content": str | list[...]}
+```
+Exported so callers can type-hint their own helpers that build message arrays without importing `Any` plumbing:
+```python
+from fieldkit.nim import ChatMessage, NIMClient
+def build_rag_prompt(question: str, chunks: list[str]) -> list[ChatMessage]:
+    return [
+        {"role": "system", "content": "Answer from the provided context only."},
+        {"role": "user", "content": "\n\n".join(chunks) + "\n\nQ: " + question},
+    ]
+```
+The alias is intentionally permissive — content may be a string, a list of multimodal parts, or any provider-specific extension. Schema validation is left to the NIM server.
 ### Context-overflow preflight
 `NIMClient.chat()` runs a token-estimate check on its message list and raises `NIMContextOverflowError(estimated_tokens, ceiling)` **before any network call** when the request would exceed `NIM_CONTEXT_WINDOW = 8192`. The opaque NIM 400 from `project_spark_nim_context_window` never surfaces.

fieldkit-0.4.1/docs/api/publish.md ADDED Viewed

@@ -0,0 +1,176 @@
+---
+module: publish
+title: fieldkit.publish
+summary: HuggingFace push surface — `ModelCard` (frontmatter + body renderer), `ArtifactManifest` (Phase-2 sync record), `HFHubAdapter` (lazy huggingface_hub wrapper, dry-run by default), `publish_quant` orchestrator. Every Orionfold artifact card carries the same Spark-tested measurement quad (perplexity, tok/s, thermal envelope, optional vertical-eval) — this module is what makes that shape deterministic.
+order: 8
+---
+## What it is
+The publishing side of the Orionfold production line. `fieldkit.quant` produces a `QuantReport`; `fieldkit.publish` turns it into a HuggingFace repo with a deterministic model card and a per-artifact YAML manifest the source repo and destination site both read.
+Three surfaces. `ModelCard` renders the canonical card shape — frontmatter (license, library_name, base_model, tags, model_creator), a `## Spark-tested` block (perplexity + tok/s + thermal envelope + optional vertical-eval table), a `## Variants` table, an auto-generated `## How to run` body (`huggingface-cli download` + `llama-server` + `llama-cpp-python` snippets templated from the HF repo path), an optional `## Lineage` block (rendered from a `fieldkit.lineage.LineageStore` if provided), a `## Methods` backlink to the anchor article, and an Orionfold LLC footer. `ArtifactManifest` is the frozen dataclass for `src/content/artifacts/<slug>.yaml` — the Phase-2 sync record per `project_artifact_manifests_phase2`; the destination renders catalog pages from `getCollection('artifacts')`. `HFHubAdapter` is a lazy wrapper around `huggingface_hub` — defaults to `dry_run=True` (stages files + logs the would-be calls; no network, no token); flip `dry_run=False` to push via `HfApi().upload_folder(...)`.
+The module exists because manual card authoring at MTBM's 3–5-day cadence is the bottleneck. Every quant needs a tags list, a perplexity table, a tok/s number, a thermal envelope note, a lineage backlink — and getting any of those wrong on the customer-facing HF page is a trust hit. `fieldkit.publish` makes the card the deterministic output of the quant+lineage run, not a hand-edit, so the only knobs the operator sets are the ones that genuinely require human judgement (the upstream license, the chat format, the featured variant).
+## Public API
+```python
+from fieldkit.publish import (
+    ARTIFACT_KINDS, ArtifactKind, ArtifactManifest,
+    HFHubAdapter, HFHubNotAvailable, HFAuthError,
+    ModelCard, PublishError, PublishResult,
+    publish_quant, write_artifact_manifest,
+    ORIONFOLD_BRAND, ORIONFOLD_HF_HANDLE, ORIONFOLD_HF_ORG,
+)
+```
+### `ORIONFOLD_BRAND` + `ORIONFOLD_HF_HANDLE`
+```python
+ORIONFOLD_BRAND = "Orionfold LLC"
+ORIONFOLD_HF_HANDLE = "Orionfold"
+```
+The brand stamped on every card footer, and the HuggingFace user handle every repo lands under (`Orionfold/<model>-GGUF`, Bartowski-shape). `ORIONFOLD_HF_ORG` is a back-compat alias for `ORIONFOLD_HF_HANDLE` — kept callable for out-of-tree imports, slated for removal in a future cut.
+### `ARTIFACT_KINDS`
+```python
+ARTIFACT_KINDS = (
+    "quant", "lora", "adapter", "embed",
+    "reranker", "dataset", "space", "bench",
+)
+```
+The manifest `kind` enum. Mirrored by `src/content.config.ts`'s `ARTIFACT_KINDS` so Astro Zod validation and the Python writer stay in lockstep.
+### `ModelCard(...)`
+Frozen dataclass + `render() → str`. Constructed by `publish_quant` from a `QuantReport`-shaped object plus the resolved license / chat_format / recommended_variant triple. Renders to a single `README.md`-style string.
+Key fields:
+```python
+ModelCard(
+    title="finance chat GGUF",
+    one_liner="...",
+    base_model="AdaptLLM/finance-chat",
+    license="llama2",                  # ← HF frontmatter scalar; reflects upstream model's license
+    library_name="gguf",
+    pipeline_tag="text-generation",
+    tags=("gguf", "spark-tested", "orionfold", "base_model:AdaptLLM/finance-chat"),
+    quant_format="gguf",
+    variants=({"name": "Q4_K_M", "size": "3.8 GB", "recommended": "..."}, ...),
+    perplexity={"Q4_K_M": 6.221, "Q8_0": 6.137, ...},
+    tokens_per_sec={"Q4_K_M": 31.1, "Q8_0": 8.9, ...},
+    sustained_load_minutes=2.18,
+    vertical_eval={"Q4_K_M": 0.14, ...},                       # optional 5th column
+    vertical_eval_name="FinanceBench (n=50, numeric_match)",
+    hf_repo="Orionfold/finance-chat-GGUF",                    # drives default `## How to run` body
+    chat_format="llama-2",                                     # → llama_cpp.Llama(chat_format=...)
+    recommended_variant="Q5_K_M",                              # featured in default snippets
+    ollama_pull_handle=None,                                   # opt-in override; default body wins otherwise
+    transformers_snippet=None,
+    lineage_prompt=None,                                       # injected by publish_quant if a LineageStore is supplied
+    article_slug="becoming-a-gguf-publisher-on-spark",
+    article_title="...",
+    model_creator=ORIONFOLD_BRAND,
+)
+```
+`render()` emits sections in canonical order: YAML frontmatter → title + elevator → `## Spark-tested` (omitted if no measurements) → `## Variants` → `## How to run` (auto-rendered defaults when no explicit handle/snippet given; entirely omitted if no defaults templatable) → `## Lineage` (if `lineage_prompt` supplied) → `## Methods` link → footer.
+### `ArtifactManifest(...)`
+Frozen dataclass for `src/content/artifacts/<slug>.yaml`. Flat-by-design (primitive types + dicts of primitives) so the YAML emitter is hand-rolled stdlib.
+```python
+m = ArtifactManifest(
+    slug="finance-chat-gguf",
+    kind="quant",
+    artifact_class="gguf",                  # serialized as `class:` in YAML
+    base_model="AdaptLLM/finance-chat",
+    hf_repo="Orionfold/finance-chat-GGUF",
+    variants=("Q4_K_M", "Q5_K_M", "Q6_K", "Q8_0", "F16"),
+    perplexity={"Q4_K_M": 6.221, ...},
+    spark_tokens_per_sec={"Q4_K_M": 31.09, ...},
+    sustained_load_minutes=2.18,
+    vertical_eval={"Q4_K_M": 0.14, ...},
+    vertical_eval_name="FinanceBench (n=50, numeric_match)",
+    lineage_run_id=None,
+    license_tier="free",                    # Orionfold commercial tier (free / pro)
+    license_commercial_tier=None,
+    model_license="llama2",                 # upstream model license (HF frontmatter shape)
+    article="articles/becoming-a-gguf-publisher-on-spark/",
+    civitai_id=None,
+    download_count=None,
+    published_at="2026-05-14T04:46:11Z",
+)
+print(m.to_yaml())
+```
+The `license_tier` / `license_commercial_tier` fields live alongside `model_license` under a nested `license:` block in YAML output. Mac destination's Zod schema mirrors this shape.
+### `write_artifact_manifest(manifest, *, artifacts_dir)`
+Writes the manifest to `<artifacts_dir>/<slug>.yaml`. Creates the directory if missing. Returns the absolute path of the written file — callers can stage it alongside the article for the next git commit.
+### `HFHubAdapter(staging_dir, *, dry_run=True, token=None, org=ORIONFOLD_HF_HANDLE)`
+Thin wrapper around `huggingface_hub`. Dry-run by default: lays out the upload set on disk under `staging_dir`, logs the would-be calls. No HF imports required, no token required. Flip `dry_run=False` to push; the lazy import of `huggingface_hub` fires only then.
+```python
+adapter = HFHubAdapter(staging_dir="/tmp/orionfold-stage/finance-chat", dry_run=True)
+adapter.stage_text(card.render(), "README.md")          # stages from a string
+adapter.stage_file(gguf_path, "model-Q4_K_M.gguf")      # stages by copying a file
+result = adapter.push_folder(repo_name="finance-chat-GGUF")
+result.dry_run        # True
+result.files_uploaded  # ('README.md', 'model-Q4_K_M.gguf', ...)
+result.logged_calls   # the upload_folder kwargs that would have fired
+```
+Token resolution order: explicit `token=` arg → `HF_TOKEN` env → `HUGGING_FACE_HUB_TOKEN` env → `huggingface_hub`'s cached login. If all four are absent and `dry_run=False`, `HFAuthError` raises before the network call.
+### `publish_quant(*, quant_report, base_model, repo_name, staging_dir, ...) → PublishResult`
+The one-line orchestrator. Reads the duck-typed `quant_report` fields (`.format`, `.variants`, `.perplexity`, `.tokens_per_sec`, `.sustained_load_minutes`, `.variant_files`, `.vertical_eval`, `.vertical_eval_name`, `.model_license`, `.chat_format`, `.recommended_variant`), builds a `ModelCard`, stages the README + variant files, writes the `ArtifactManifest` (if `artifacts_dir` supplied), and invokes `HFHubAdapter.push_folder()`. Explicit kwargs override duck-typed report attrs.
+```python
+result = publish_quant(
+    quant_report=report,
+    base_model="AdaptLLM/finance-chat",
+    repo_name="finance-chat-GGUF",
+    staging_dir="/tmp/orionfold-stage/finance-chat",
+    artifacts_dir="/home/nvidia/ai-field-notes/src/content/artifacts",
+    article_slug="becoming-a-gguf-publisher-on-spark",
+    article_title="...",
+    vertical_eval={"Q4_K_M": 0.14, "Q5_K_M": 0.16, ...},
+    vertical_eval_name="FinanceBench (n=50, numeric_match)",
+    model_license="llama2",            # critical — never default silently to apache-2.0
+    chat_format="llama-2",
+    recommended_variant="Q5_K_M",
+    lineage_store=store,                # optional; injects ## Lineage block
+    dry_run=True,                       # flip to False for the actual push
+)
+result.hf_repo         # 'Orionfold/finance-chat-GGUF'
+result.card_path       # Path('/tmp/orionfold-stage/.../README.md')
+result.manifest_path   # Path('.../src/content/artifacts/finance-chat-gguf.yaml')
+result.hf_url          # None in dry-run; set after live push
+```
+The `model_license` / `chat_format` / `recommended_variant` kwargs landed in v0.4.x after the `Orionfold/finance-chat-GGUF` dry-run surfaced two card-rendering bugs: a hardcoded `license: apache-2.0` (wrong for the Llama-2 lineage AdaptLLM base) and an empty `## How to run` section (when no ollama handle or transformers snippet was supplied, the section header rendered with no body). Both are now caller-controlled with sane defaults.
+## Why this surface
+Three things to notice. First, `HFHubAdapter` defaults to dry-run because the right workflow is dry-run → human review → live push. Library users who want a one-shot live push pass `dry_run=False` explicitly; library users who want the staging artifact for review (the common case during development) get it for free. The `hf-publisher` skill (`/home/nvidia/.claude/skills/hf-publisher/`) wraps this workflow as a triggered Claude Code surface.
+Second, `publish_quant` duck-types its report rather than importing `fieldkit.quant.QuantReport` directly. This avoids a circular import (quant doesn't depend on publish; publish doesn't depend on quant) and lets non-quant callers — a LoRA pipeline, an embedding pipeline — supply their own report-shaped objects without subclassing.
+Third, `ArtifactManifest` is structurally distinct from `ModelCard` even though they overlap. The card is for HuggingFace; the manifest is for the destination Astro catalog. Both encode the same artifact, but the *consumers* are different and have different schemas. Keeping them separate dataclasses lets each evolve independently — and lets `write_artifact_manifest` write the manifest even when the HF push is dry-run, which is what the source repo commits look like during article-only iterations.
+## Samples
+- [`scripts/g3_build_first_quant.sh`](https://github.com/manavsehgal/ai-field-notes/blob/main/scripts/g3_build_first_quant.sh) — `publish-dryrun` step assembles a `QuantReport`-shaped `SimpleNamespace` from the measurement JSON and calls `publish_quant(..., dry_run=True)`.
+- [`scripts/g3_push_first_quant.py`](https://github.com/manavsehgal/ai-field-notes/blob/main/scripts/g3_push_first_quant.py) — the live-push one-shot. Reuses the existing dry-run stage; calls `HFHubAdapter(staging_dir=..., dry_run=False).push_folder()` directly so the 32 GB of GGUF bytes don't get re-staged.
+- [`articles/becoming-a-gguf-publisher-on-spark/`](https://ainative.business/field-notes/becoming-a-gguf-publisher-on-spark/) — anchor article. Walks the v0.4.x publish surface end-to-end against `Orionfold/finance-chat-GGUF` and narrates the two bugs that v0.4.0 fixed before tagging.

fieldkit 0.3.0__tar.gz → 0.4.1__tar.gz

fieldkit 0.3.0tar.gz → 0.4.1tar.gz