fieldkit 0.3.0__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. {fieldkit-0.3.0 → fieldkit-0.4.0}/CHANGELOG.md +78 -0
  2. {fieldkit-0.3.0 → fieldkit-0.4.0}/PKG-INFO +1 -1
  3. {fieldkit-0.3.0 → fieldkit-0.4.0}/docs/api/capabilities.md +49 -0
  4. {fieldkit-0.3.0 → fieldkit-0.4.0}/docs/api/cli.md +1 -1
  5. {fieldkit-0.3.0 → fieldkit-0.4.0}/docs/api/eval.md +69 -0
  6. {fieldkit-0.3.0 → fieldkit-0.4.0}/docs/api/nim.md +23 -0
  7. fieldkit-0.4.0/docs/api/publish.md +176 -0
  8. fieldkit-0.4.0/docs/api/quant.md +138 -0
  9. {fieldkit-0.3.0 → fieldkit-0.4.0}/docs/api/rag.md +17 -0
  10. {fieldkit-0.3.0 → fieldkit-0.4.0}/src/fieldkit/_version.py +1 -1
  11. {fieldkit-0.3.0 → fieldkit-0.4.0}/src/fieldkit/eval/__init__.py +17 -0
  12. fieldkit-0.4.0/src/fieldkit/eval/vertical.py +358 -0
  13. fieldkit-0.4.0/src/fieldkit/publish/__init__.py +982 -0
  14. fieldkit-0.4.0/src/fieldkit/quant/__init__.py +568 -0
  15. fieldkit-0.4.0/tests/test_publish.py +807 -0
  16. fieldkit-0.4.0/tests/test_quant.py +314 -0
  17. fieldkit-0.4.0/tests/test_vertical_bench.py +361 -0
  18. {fieldkit-0.3.0 → fieldkit-0.4.0}/.gitignore +0 -0
  19. {fieldkit-0.3.0 → fieldkit-0.4.0}/LICENSE +0 -0
  20. {fieldkit-0.3.0 → fieldkit-0.4.0}/README.md +0 -0
  21. {fieldkit-0.3.0 → fieldkit-0.4.0}/docs/api/lineage.md +0 -0
  22. {fieldkit-0.3.0 → fieldkit-0.4.0}/docs/api/training.md +0 -0
  23. {fieldkit-0.3.0 → fieldkit-0.4.0}/pyproject.toml +0 -0
  24. {fieldkit-0.3.0 → fieldkit-0.4.0}/samples/bench-rag.py +0 -0
  25. {fieldkit-0.3.0 → fieldkit-0.4.0}/samples/feasibility-math.py +0 -0
  26. {fieldkit-0.3.0 → fieldkit-0.4.0}/samples/hello-lineage.py +0 -0
  27. {fieldkit-0.3.0 → fieldkit-0.4.0}/samples/hello-nim.py +0 -0
  28. {fieldkit-0.3.0 → fieldkit-0.4.0}/samples/naive-rag.py +0 -0
  29. {fieldkit-0.3.0 → fieldkit-0.4.0}/src/fieldkit/__init__.py +0 -0
  30. {fieldkit-0.3.0 → fieldkit-0.4.0}/src/fieldkit/capabilities/__init__.py +0 -0
  31. {fieldkit-0.3.0 → fieldkit-0.4.0}/src/fieldkit/capabilities/data/__init__.py +0 -0
  32. {fieldkit-0.3.0 → fieldkit-0.4.0}/src/fieldkit/capabilities/data/spark-capabilities.json +0 -0
  33. {fieldkit-0.3.0 → fieldkit-0.4.0}/src/fieldkit/cli/__init__.py +0 -0
  34. {fieldkit-0.3.0 → fieldkit-0.4.0}/src/fieldkit/lineage/__init__.py +0 -0
  35. {fieldkit-0.3.0 → fieldkit-0.4.0}/src/fieldkit/nim/__init__.py +0 -0
  36. {fieldkit-0.3.0 → fieldkit-0.4.0}/src/fieldkit/rag/__init__.py +0 -0
  37. {fieldkit-0.3.0 → fieldkit-0.4.0}/src/fieldkit/training/__init__.py +0 -0
  38. {fieldkit-0.3.0 → fieldkit-0.4.0}/tests/__init__.py +0 -0
  39. {fieldkit-0.3.0 → fieldkit-0.4.0}/tests/conftest.py +0 -0
  40. {fieldkit-0.3.0 → fieldkit-0.4.0}/tests/test_capabilities.py +0 -0
  41. {fieldkit-0.3.0 → fieldkit-0.4.0}/tests/test_cli.py +0 -0
  42. {fieldkit-0.3.0 → fieldkit-0.4.0}/tests/test_eval.py +0 -0
  43. {fieldkit-0.3.0 → fieldkit-0.4.0}/tests/test_lineage.py +0 -0
  44. {fieldkit-0.3.0 → fieldkit-0.4.0}/tests/test_nim.py +0 -0
  45. {fieldkit-0.3.0 → fieldkit-0.4.0}/tests/test_nim_spark.py +0 -0
  46. {fieldkit-0.3.0 → fieldkit-0.4.0}/tests/test_rag.py +0 -0
  47. {fieldkit-0.3.0 → fieldkit-0.4.0}/tests/test_rag_spark.py +0 -0
  48. {fieldkit-0.3.0 → fieldkit-0.4.0}/tests/test_training.py +0 -0
@@ -6,6 +6,84 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and
6
6
 
7
7
  ## [Unreleased]
8
8
 
9
+ ## [0.4.0] — 2026-05-14
10
+
11
+ Fourth public release. Two new top-level modules (`fieldkit.publish` + `fieldkit.quant`) for the G3 GGUF / Quantization Publisher pick (MTBM Pick #1 per `ideas/mtbm-use-cases.md` §6), the v0.4.x **vertical-curator overlay** on `fieldkit.eval` (`VerticalBench`), and post-dry-run card-rendering fixes that landed the first live HF push (`Orionfold/finance-chat-GGUF`). The two new modules together unlock most of Cluster G; this cut implements the GGUF critical path and stubs the other quant formats with named entry points pointing at the v0.5+ roadmap.
12
+
13
+ ### Added — `fieldkit.publish` (new module)
14
+
15
+ HuggingFace Hub adapter + auto model card builder from `fieldkit.lineage`. Three public surfaces:
16
+
17
+ - **`fieldkit.publish.ModelCard`** — frontmatter + body builder. Renders the canonical card every Orionfold artifact gets: YAML frontmatter (license, library_name, base_model, pipeline_tag, tags, model_creator), a title + elevator, a **Spark-tested** block (per-variant perplexity + tok/s + thermal envelope), a variants table, **How to run** (`ollama pull` + `from_pretrained` snippets), an optional **Lineage** block (rendered from a `fieldkit.lineage.LineageStore` if provided), a **Methods** backlink to `ainative.business/field-notes/<slug>/`, and a footer attributing the publication to Orionfold LLC.
18
+ - **`fieldkit.publish.ArtifactManifest`** — frozen dataclass for the `src/content/artifacts/<slug>.yaml` Phase-2 sync record (per memory `project_artifact_manifests_phase2`). `to_yaml()` emits via a hand-rolled stdlib emitter so the module has no runtime YAML dep. The source repo writes one of these per push; the Mac destination renders `/artifacts/<kind>/` catalog pages from `getCollection('artifacts')`.
19
+ - **`fieldkit.publish.HFHubAdapter`** — lazy-`huggingface_hub` wrapper. Defaults to `dry_run=True` (stages files on disk, logs the would-be calls, no network). Flip `dry_run=False` to push via `HfApi().upload_folder(...)`. Token resolution order: explicit `token=` → `HF_TOKEN` env → cached login. The dry-run path is fully testable offline.
20
+
21
+ Plus an orchestrator: **`fieldkit.publish.publish_quant(...)`** — one-line caller that ingests a `QuantReport`-shaped object (duck-typed; produced by `fieldkit.quant.quantize_gguf`), renders the card, writes the manifest, stages the variant files, and pushes (or dry-runs) the HF commit.
22
+
23
+ Branded constants: `ORIONFOLD_BRAND = "Orionfold LLC"`, `ORIONFOLD_HF_HANDLE = "Orionfold"` (was `ORIONFOLD_HF_ORG = "orionfoldllc"` until 2026-05-14, when publishing moved to the existing user-account handle — Bartowski-shape personal handle precedent). Per the 2026-05-12 HANDOFF Q3 decision: Orionfold LLC is the parent brand for all AI-artifact publishing surfaces; repo names follow the Bartowski shape (`Orionfold/<model>-GGUF`, `Orionfold/<model>-LoRA`). `ORIONFOLD_HF_ORG` is retained as a back-compat alias pointing at the new constant; will be dropped at the next major cut.
24
+
25
+ ### Added — `fieldkit.quant` (new module)
26
+
27
+ Quantization dispatcher. GGUF path implemented; AWQ/GPTQ/EXL3/MLX/NVFP4 declared as named stubs pointing at the roadmap.
28
+
29
+ - **`fieldkit.quant.quantize_gguf(...)`** — wraps `llama.cpp/convert_hf_to_gguf.py` + `llama-quantize` to emit one GGUF file per requested variant (canonical Orionfold set: `Q4_K_M`, `Q5_K_M`, `Q6_K`, `Q8_0`, `F16`). Auto-derives F16 from a HF Transformers checkpoint when the source isn't already a GGUF. `dry_run=True` enumerates the would-be subprocess commands into `report.notes` without invoking them — used by tests and CI.
30
+ - **`fieldkit.quant.measure_perplexity_gguf(...)`** — wraps `llama-perplexity`. Parses output via `parse_perplexity_output()` which recognizes the standard `Final estimate: PPL = N.NNN` shape and the lowercase `perplexity = N.NNN` fallback. Returns `None` on parse failure (cards ship without a perplexity column if measurement was skipped).
31
+ - **`fieldkit.quant.measure_tokens_per_sec_gguf(...)`** — wraps `llama-bench`. Parses output via `parse_llama_bench_output()` for `tg` (text-gen, default) or `pp` (prompt-process) tok/s.
32
+ - **`fieldkit.quant.ThermalProbe`** — pure-stdlib `nvidia-smi` poll loop. Reports sustained-load minutes before throttle, per the 2026-05-12 HANDOFF Q9 decision to publish duty-cycle limits on every Orionfold card.
33
+ - **`fieldkit.quant.LlamaCppPaths`** — locator for `llama-quantize` / `llama-perplexity` / `llama-bench` / `convert_hf_to_gguf.py`. Env defaults: `LLAMA_CPP_BIN` directory, `LLAMA_CPP_CONVERT` script path. Override any field directly.
34
+ - **`fieldkit.quant.QuantReport`** — canonical dataclass output. The contract `fieldkit.publish.publish_quant()` consumes.
35
+ - **`fieldkit.quant.quantize_awq` / `quantize_gptq` / `quantize_exl3` / `quantize_mlx` / `quantize_nvfp4`** — named entry-point stubs. Raise `NotImplementedError` with a one-liner pointing at `ideas/mtbm-use-cases.md` §7. Locks the v0.4 public surface so v0.5+ implementations slot in without an API break.
36
+
37
+ ### Added — `fieldkit.eval.VerticalBench` (v0.4.x — vertical-curator overlay)
38
+
39
+ Lightweight JSONL-loader wrapper around `fieldkit.eval.Bench` for vertical-domain accuracy scoring (FinanceBench / LegalBench / SemEval / generic). Drives the **vertical-curator pivot** announced 2026-05-13 (HANDOFF §2 + `ideas/mtbm-use-cases.md` §6 Pick #1.b + §8.5.1): every Orionfold quant card now ships with a vertical-domain accuracy axis, not just wikitext perplexity. Lives in `fieldkit/src/fieldkit/eval/vertical.py`; re-exported at the package root for `from fieldkit.eval import VerticalBench`.
40
+
41
+ - **`fieldkit.eval.VerticalBench`** + **`VerticalQA`** — bench shape, JSONL loader, scorer plumbing. Accepts any `Callable[[str], str]` as the model function so subprocess (`llama-cli`), in-process (`llama-cpp-python`), or NIM-backed scoring all slot in. Per-call latency aggregates alongside accuracy + refusal via the underlying `Bench`.
42
+ - **`fieldkit.eval.VerticalBench.from_jsonl(path, format='auto', ...)`** — auto-detects `financebench` / `legalbench` / `generic` JSONL shapes from the first row's field signature. Per-row metadata (company, doc_period, question_type, task) flows into per-call tags for slice-by aggregation downstream.
43
+ - **Scorers** — `exact_match`, `contains`, `numeric_match` (with configurable `rel_tolerance`, default 1% — FinanceBench convention). The bench picks `numeric_match` by default for FinanceBench-shape JSONL, `exact_match` for LegalBench-shape.
44
+
45
+ ### Added — license + How-to-run defaults on `fieldkit.publish` (v0.4.x — `Orionfold/finance-chat-GGUF` dry-run found two card bugs)
46
+
47
+ - **`ModelCard.license`** is now reachable from `publish_quant(..., model_license=...)` (and the duck-typed `quant_report.model_license` attribute). Previously the kwarg didn't exist and every card defaulted to `apache-2.0` — wrong for any Llama / Gemma / Qwen / CC-BY-NC base. AdaptLLM/finance-chat now correctly publishes with `license: llama2`.
48
+ - **`ArtifactManifest.model_license`** mirrors the same value into the Astro manifest under `license.model:`. Astro Zod schema (`src/content.config.ts`) extended with `license.model: z.string().optional()` so destination catalog pages and HF badges stay in sync. The `license.tier:` field (commercial-distribution tier — `free` / `pro`) stays distinct from this upstream-license field.
49
+ - **`ModelCard.hf_repo`** + **`ModelCard.chat_format`** + **`ModelCard.recommended_variant`** — three new fields that drive an auto-rendered default `## How to run` body. Before this fix, cards with no explicit `ollama_pull_handle` / `transformers_snippet` rendered an empty section header (the second finance-chat bug). The new renderer auto-builds three code blocks templated from `hf_repo` + a featured variant: `huggingface-cli download`, `llama-server` (OpenAI-compatible serve), and `llama-cpp-python` (in-process, threading `chat_format` if set). When all three new fields are absent + no explicit handle/snippet supplied, the section is omitted entirely (no more empty headers).
50
+ - **`publish_quant(..., model_license=, chat_format=, recommended_variant=)`** kwargs added — orchestrate all three through to card + manifest. Same duck-typed fallback through `quant_report` attributes.
51
+ - **`scripts/g3_build_first_quant.sh`** — `MODEL_LICENSE` / `CHAT_FORMAT` / `RECOMMENDED_VARIANT` env knobs added with case-statement overrides (`AdaptLLM/finance-chat → llama2 + llama-2`). Default `MODEL_LICENSE=apache-2.0` + `RECOMMENDED_VARIANT=Q5_K_M` for greenfield runs.
52
+ - **`scripts/g3_push_first_quant.py`** (new) — one-shot live-push helper that reuses the existing dry-run stage (no 32 GB re-copy via `publish_quant(dry_run=False)`); calls `HFHubAdapter.push_folder()` directly. Bakes in xet-safety env (`HF_HOME=/home/nvidia/data/.hf-cache` + `HF_HUB_DISABLE_XET=1`) per the Spark-side `~/.cache/huggingface/` permission landmine; sources `HF_TOKEN` from `.env.local` (chmod 600).
53
+ - **+11 tests** (full suite: 379 passed, 2 skipped offline). Covers: model_license override flow, default apache-2.0 fallback, default GGUF How-to-run rendering, `recommended_variant` override, `hf_repo`-less skip-section behavior, manifest `license.model` emission.
54
+
55
+ ### Added — vertical-eval surface on `fieldkit.publish`
56
+
57
+ `ModelCard` + `ArtifactManifest` + `publish_quant(...)` extended to thread per-variant vertical-eval scores through to the rendered card and the Phase-2 sync manifest:
58
+
59
+ - **`ModelCard.vertical_eval: dict[str, float]`** + **`ModelCard.vertical_eval_name: str`** — when set, the **Spark-tested** block renders a 5-column table (Variant / Size / Perplexity / tok/s / *Vertical-eval-name*) instead of the 4-column default, and the introductory copy switches from "measurement triple" to "measurement quad". Accuracy values render as percentages (`62.0%`). Cards without vertical eval render identically to v0.4.0 — backwards-compatible.
60
+ - **`ArtifactManifest.vertical_eval` + `vertical_eval_name`** — written into the YAML manifest under the same key names. Mac destination Zod schema (`src/content.config.ts`) extended to accept both. Manifests without vertical eval skip the field entirely.
61
+ - **`publish_quant(..., vertical_eval=, vertical_eval_name=)`** — explicit kwargs override whatever the duck-typed `quant_report` carries. Useful when scoring happens out-of-band from quantization (the canonical path on Spark: quantize 5 variants → measure each variant via `g3_measure_variants.py`, which calls `VerticalBench.run(llama_cli_fn)` and then feeds the resulting accuracy dict back into `publish_quant`).
62
+
63
+ ### Schema changes
64
+
65
+ - `src/content.config.ts` — `FIELDKIT_MODULES` extended to include `'quant'` and `'publish'` in canonical order (`capabilities, nim, rag, eval, training, lineage, quant, publish, cli`).
66
+ - `src/content.config.ts` — new `artifacts` Astro collection (Phase 2 sync contract). Loads YAML manifests from `src/content/artifacts/*.yaml`; Zod schema mirrors `fieldkit.publish.ArtifactManifest`. `ARTIFACT_KINDS` enum exposed alongside `FIELDKIT_MODULES` for downstream filtering. `src/content/artifacts/` directory created (empty + `.gitkeep`); first manifest will land when the first quant ships.
67
+ - `src/content.config.ts` — `artifacts` schema extended with optional `vertical_eval: Record<string, number>` + `vertical_eval_name: string` (vertical-curator pivot 2026-05-13).
68
+
69
+ ### Test suite
70
+
71
+ **130 new tests** across `tests/test_publish.py` (42, +16 from v0.4 scaffold incl. +11 for the model_license + How-to-run defaults fix), `tests/test_quant.py` (37), and `tests/test_vertical_bench.py` (39, new file), plus targeted regression coverage. Total: **379 passed, 2 skipped** offline (`pytest -q`). The 2 skips are `--spark`-gated live integration tests (chat NIM + pgvector); the v0.3 torch module-level skip has been resolved by lazy-importing torch only inside the training entry points. All new tests run offline — `dry_run=True` paths for `HFHubAdapter`, `publish_quant`, and `quantize_gguf` exercise the full code path without `huggingface_hub`, llama.cpp binaries, or `nvidia-smi` available. `VerticalBench` tests run without a model — `model_fn` is a callable, so a plain `lambda` exercises the full scoring + bench-aggregation path.
72
+
73
+ ### Articles in this release
74
+
75
+ - [`becoming-a-gguf-publisher-on-spark`](https://ainative.business/field-notes/becoming-a-gguf-publisher-on-spark/) — G3 v0 anchor article. 3,388 words; documents the five-variant `Orionfold/finance-chat-GGUF` release end-to-end (Spark-tested perplexity / tok/s / sustained-load minutes / FinanceBench accuracy across F16, Q8_0, Q6_K, Q5_K_M, Q4_K_M) plus the V0 preflight-bench gate and the V1 chat-vs-continued-pretrain lesson. `hf_url:` frontmatter threads the live HF receipt onto the article.
76
+
77
+ ### Verified on Spark
78
+
79
+ - **Live HF push:** `Orionfold/finance-chat-GGUF` shipped 2026-05-14 at <https://huggingface.co/Orionfold/finance-chat-GGUF> — 5 GGUF variants + auto-rendered README in 1h 57min. Repo returns HTTP 200, all 6 files present. `publish_quant(dry_run=False)` path exercised end-to-end.
80
+ - **Five-variant measurement card** (F16 / Q8_0 / Q6_K / Q5_K_M / Q4_K_M) with the four Spark-tested axes — perplexity (wikitext-2), tg + pp tok/s (`llama-bench`), sustained-load minutes (`ThermalProbe` via `nvidia-smi`), and FinanceBench accuracy (n=50, `numeric_match`, open-book) — all produced via `fieldkit.quant.measure_*` + `fieldkit.eval.VerticalBench.run(...)` on GB10.
81
+
82
+ ### Deferred to v0.5
83
+
84
+ - `fieldkit.image-lora` + `fieldkit.civitai` — Pick #2 (G9) prep. Deferred per the 2026-05-12 HANDOFF Q10 decision to sequence G3 → G9 rather than parallelize. Will land once G3 v0 proves the `fieldkit.publish` infra.
85
+ - Non-GGUF formats in `fieldkit.quant` (AWQ, GPTQ, EXL3, MLX, NVFP4). The G3 v0 niche-positioning is Nemotron-family GGUFs with the Spark-tested layer; other formats are pure surface-area expansion and can wait for an audience signal.
86
+
9
87
  ## [0.3.0] — 2026-05-11
10
88
 
11
89
  Third public release. One new top-level module (`fieldkit.lineage`) lifted from the [auto-research-loop-on-spark article](https://ainative.business/field-notes/auto-research-loop-on-spark/) — the portable part of cxcscmu's *Auto-Research-Recipes* harness, decomposed into a pure-stdlib substrate any harness on the Spark can write into.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: fieldkit
3
- Version: 0.3.0
3
+ Version: 0.4.0
4
4
  Summary: Verified-on-Spark patterns lifted from the ai-field-notes blog into one importable Python package.
5
5
  Project-URL: Homepage, https://ainative.business/fieldkit/
6
6
  Project-URL: Source, https://github.com/manavsehgal/ai-field-notes/tree/main/fieldkit
@@ -80,6 +80,55 @@ practical_inference_envelope("70B params fp8")
80
80
 
81
81
  Raises `UnknownEnvelope` if no rule matches.
82
82
 
83
+ ### Supporting types
84
+
85
+ The `Capabilities` view is composed of three frozen dataclasses. You normally read them off `Capabilities.load()` rather than constructing them directly, but the types are re-exported for type-hinting and structural pattern-matching.
86
+
87
+ #### `Hardware`
88
+
89
+ ```python
90
+ @dataclass(frozen=True, slots=True)
91
+ class Hardware:
92
+ name: str # "DGX Spark"
93
+ unified_memory_gb: int # 128
94
+ memory_topology: str # "unified CPU+GPU"
95
+ compute_arch: str # "GB10 Grace Blackwell"
96
+ supported_dtypes: tuple[str, ...] # ("fp32", "bf16", "fp16", ...)
97
+ interconnect_to_other_gpus: str
98
+ ```
99
+
100
+ Reachable as `Capabilities.load().hardware`. Use it to gate code paths on `unified_memory_gb` or `compute_arch` without re-parsing the JSON.
101
+
102
+ #### `MemoryBudgetRulesOfThumb`
103
+
104
+ ```python
105
+ @dataclass(frozen=True, slots=True)
106
+ class MemoryBudgetRulesOfThumb:
107
+ param_bytes: dict[str, float] # mirrors DTYPE_BYTES
108
+ training_overhead_multiplier: str
109
+ kv_cache_per_token_per_layer: str
110
+ practical_inference_envelope: dict[str, str] # {"8B params bf16": "..."}
111
+ practical_finetune_envelope: dict[str, str]
112
+ ```
113
+
114
+ Backs `practical_inference_envelope()`. Inspect `caps.memory_budget_rules_of_thumb.practical_finetune_envelope` directly when you want the fine-tune table instead of the inference one.
115
+
116
+ #### `StackEntry`
117
+
118
+ ```python
119
+ @dataclass(frozen=True, slots=True)
120
+ class StackEntry:
121
+ id: str # "nim", "nemo", "trt-llm", ...
122
+ label: str
123
+ purpose: str
124
+ verified_in_articles: tuple[str, ...] = ()
125
+ known_limits: tuple[str, ...] = ()
126
+ fits_paper_shapes: tuple[str, ...] = ()
127
+ supported_models_at_spark_scale: tuple[str, ...] = ()
128
+ ```
129
+
130
+ One entry per Spark-relevant stack component. `frontier-scout` uses `fits_paper_shapes` to decide whether a paper's training recipe matches a stack we have running notes for; the `verified_in_articles` tuple links back into ai-field-notes slugs that proved a given stack on the box.
131
+
83
132
  ### `DTYPE_BYTES`
84
133
 
85
134
  Bytes-per-parameter table:
@@ -2,7 +2,7 @@
2
2
  module: cli
3
3
  title: fieldkit (CLI)
4
4
  summary: A thin Typer wrapper over the modules. Quick checks and smoke benchmarks without writing Python.
5
- order: 6
5
+ order: 7
6
6
  ---
7
7
 
8
8
  ## What it is
@@ -9,6 +9,12 @@ order: 4
9
9
 
10
10
  The eval harnesses the project keeps reinventing: a per-call latency benchmarker that emits the same JSON shape as `articles/*/evidence/benchmark.py`, an LLM-as-judge with the three rubrics from `rag-eval-ragas-and-nemo-evaluator`, a trajectory analyzer for agent-loop JSONL, and a refusal regex catalog unioned across the project's articles.
11
11
 
12
+ **v0.4.x additions** (vertical-curator surface for the G3 GGUF publisher pipeline):
13
+
14
+ - `VerticalBench` — Spark-overlay scorer for FinanceBench / LegalBench / SemEval-style JSONL test sets. Wraps `Bench`, so latency aggregates alongside accuracy and refusal. Network access lives in the caller (`llama-cli`, NIM, vLLM) — the bench itself is offline-only and unit-testable.
15
+ - `VerticalQA` — one test case (qid + question + expected + tags) lifted from a vertical-eval JSONL.
16
+ - `exact_match` / `contains` / `numeric_match` — the three built-in scorers. `numeric_match` is the FinanceBench default (first-number ±1% rel-tol); `exact_match` is the LegalBench default; `contains` is the right pick when the model answers in prose around a key fact.
17
+
12
18
  **v0.2 additions** (verifier-loop and agent-bench primitives):
13
19
 
14
20
  - `AssertionGrader` — pure file-system grader over five assertion primitives (`file_exists`, `file_not_exists`, `file_contents_contain`, `file_contents_match_regex`, `file_unchanged`). Lifted from `clawgym-on-spark`'s deterministic grader.
@@ -44,6 +50,10 @@ from fieldkit.eval import (
44
50
 
45
51
  # v0.2 — matched-base comparison
46
52
  MatchedBaseComparison, MatchedBaseComparisonResult, GroupStats,
53
+
54
+ # v0.4.x — vertical-curator surface
55
+ VerticalBench, VerticalQA,
56
+ contains, exact_match, numeric_match,
47
57
  )
48
58
  ```
49
59
 
@@ -218,6 +228,65 @@ json.dump(result.to_dict(), open("comparison.json", "w"), indent=2)
218
228
 
219
229
  `MatchedBaseComparison.stats(rows)` is exposed separately when you only need single-rollout aggregation (no comparison). Accepts a list/iterable of dicts or a JSONL path.
220
230
 
231
+ ### `VerticalBench(name, questions, scorer=exact_match, ...)` *(v0.4.x)*
232
+
233
+ Spark-overlay scorer for vertical-domain test sets — FinanceBench, LegalBench, SemEval-style JSONL — that the G3 GGUF publisher pipeline uses as its fourth measurement axis alongside perplexity, tok/s, and sustained-load minutes.
234
+
235
+ The bench is intentionally callable-shaped: it accepts a `model_fn(prompt) -> str` and times each call via the existing `Bench` harness, so latency aggregates alongside accuracy and refusal. Network access lives in the caller (llama-cli, NIM, vLLM), keeping the bench offline-only for unit tests.
236
+
237
+ ```python
238
+ from fieldkit.eval import VerticalBench, numeric_match
239
+
240
+ vb = VerticalBench.from_jsonl(
241
+ "financebench.jsonl",
242
+ scorer=numeric_match, # FinanceBench → first-number ±1%
243
+ limit=50,
244
+ )
245
+
246
+ def model_fn(prompt: str) -> str:
247
+ return llama_cli_call(gguf_path, prompt)
248
+
249
+ bench = vb.run(model_fn, extra_tags={"variant": "Q4_K_M"})
250
+ print(bench.report()) # accuracy + refusal_rate + latency
251
+ ```
252
+
253
+ `VerticalBench.from_jsonl(path, *, format="auto", limit=None, scorer=None, scorer_kwargs=None)` auto-sniffs FinanceBench / LegalBench / generic schemas from the first JSON row. Rows missing the question or expected field are silently dropped (the row-count delta vs the JSONL is the diagnostic). The default scorer is `numeric_match` for FinanceBench and `exact_match` everywhere else; pass `scorer=` to override.
254
+
255
+ `VerticalBench.run(model_fn, *, limit=None, on_error="record", extra_tags=None)` returns the underlying `Bench` so callers route through the existing `.summary()` / `.report()` / `.dump()` pipeline. Each `BenchCall` carries `accuracy` (0.0/1.0 from the scorer) and `refusal` (0.0/1.0 from `is_refusal`) metrics; per-row metadata (company, doc_period, question_type) flows through to `BenchCall.tags` for downstream slice-by aggregation.
256
+
257
+ `VerticalBench.summary()` produces a lightweight `{name, n, scorer, tag_keys}` dict without invoking the model — useful in the lineage entry recording *what* the bench will measure before the model has actually run.
258
+
259
+ ### `VerticalQA` *(v0.4.x)*
260
+
261
+ ```python
262
+ @dataclass(frozen=True, slots=True)
263
+ class VerticalQA:
264
+ qid: str # FinanceBench `financebench_id`, etc.
265
+ question: str
266
+ expected: str
267
+ tags: dict[str, Any] = field(default_factory=dict)
268
+ ```
269
+
270
+ One vertical-eval test case. The `qid` is the row's stable id so per-row scores can be cross-referenced against the source JSONL; `tags` carry per-row metadata (company, doc_period, question_type) that flow through to `Bench` for slice-by aggregation downstream.
271
+
272
+ ### Scorers — `exact_match` / `contains` / `numeric_match` *(v0.4.x)*
273
+
274
+ Pluggable `Callable[[predicted, expected], float]` returning 1.0 / 0.0. Pass any custom callable into `VerticalBench(scorer=...)`; the three built-ins cover the dominant patterns:
275
+
276
+ ```python
277
+ exact_match("yes", "Yes") # 1.0 — whitespace + case-insensitive
278
+ contains("The 2023 revenue was $4.5B.", "$4.5B") # 1.0 — substring match
279
+ numeric_match("Revenue was $4.55B", "4.5B") # 1.0 — first number, ±1% rel-tol
280
+ numeric_match("Revenue was $4.55B", "4.5B",
281
+ rel_tolerance=0.001) # 0.0 — tighter tol
282
+ ```
283
+
284
+ | Scorer | When to use it |
285
+ |---|---|
286
+ | `exact_match(p, e)` | LegalBench-style single-label classification (`yes` / `no` / `hold` / `overrule`). Whitespace- and case-insensitive. |
287
+ | `contains(p, e)` | The model is asked to answer in prose and the reference is a key fact/number/phrase that must appear somewhere in the answer. |
288
+ | `numeric_match(p, e, *, rel_tolerance=0.01)` | FinanceBench-style quantitative answers. Extracts the first number from each side (commas stripped), compares under relative tolerance. Defaults to ±1% per FinanceBench's grading convention. Returns 0.0 if either side has no parseable number — including refusals, so the refusal counter elsewhere doesn't need to gate this scorer. |
289
+
221
290
  ## Samples
222
291
 
223
292
  - [`samples/bench-rag.py`](https://github.com/manavsehgal/ai-field-notes/blob/main/fieldkit/samples/bench-rag.py) — offline `Bench` + `Judge.parse` walkthrough.
@@ -62,6 +62,29 @@ chunks = chunk_text(long_doc, max_tokens=900)
62
62
 
63
63
  Polls `/models` until 200 or timeout. Returns `True` on success, `False` on timeout. Use it as the first call in any sample script that talks to a cold NIM.
64
64
 
65
+ ### `ChatMessage`
66
+
67
+ Type alias for the OpenAI-style chat message shape `NIMClient.chat()` consumes:
68
+
69
+ ```python
70
+ ChatMessage = dict[str, Any]
71
+ # Concretely: {"role": "system" | "user" | "assistant", "content": str | list[...]}
72
+ ```
73
+
74
+ Exported so callers can type-hint their own helpers that build message arrays without importing `Any` plumbing:
75
+
76
+ ```python
77
+ from fieldkit.nim import ChatMessage, NIMClient
78
+
79
+ def build_rag_prompt(question: str, chunks: list[str]) -> list[ChatMessage]:
80
+ return [
81
+ {"role": "system", "content": "Answer from the provided context only."},
82
+ {"role": "user", "content": "\n\n".join(chunks) + "\n\nQ: " + question},
83
+ ]
84
+ ```
85
+
86
+ The alias is intentionally permissive — content may be a string, a list of multimodal parts, or any provider-specific extension. Schema validation is left to the NIM server.
87
+
65
88
  ### Context-overflow preflight
66
89
 
67
90
  `NIMClient.chat()` runs a token-estimate check on its message list and raises `NIMContextOverflowError(estimated_tokens, ceiling)` **before any network call** when the request would exceed `NIM_CONTEXT_WINDOW = 8192`. The opaque NIM 400 from `project_spark_nim_context_window` never surfaces.
@@ -0,0 +1,176 @@
1
+ ---
2
+ module: publish
3
+ title: fieldkit.publish
4
+ summary: HuggingFace push surface — `ModelCard` (frontmatter + body renderer), `ArtifactManifest` (Phase-2 sync record), `HFHubAdapter` (lazy huggingface_hub wrapper, dry-run by default), `publish_quant` orchestrator. Every Orionfold artifact card carries the same Spark-tested measurement quad (perplexity, tok/s, thermal envelope, optional vertical-eval) — this module is what makes that shape deterministic.
5
+ order: 8
6
+ ---
7
+
8
+ ## What it is
9
+
10
+ The publishing side of the Orionfold production line. `fieldkit.quant` produces a `QuantReport`; `fieldkit.publish` turns it into a HuggingFace repo with a deterministic model card and a per-artifact YAML manifest the source repo and destination site both read.
11
+
12
+ Three surfaces. `ModelCard` renders the canonical card shape — frontmatter (license, library_name, base_model, tags, model_creator), a `## Spark-tested` block (perplexity + tok/s + thermal envelope + optional vertical-eval table), a `## Variants` table, an auto-generated `## How to run` body (`huggingface-cli download` + `llama-server` + `llama-cpp-python` snippets templated from the HF repo path), an optional `## Lineage` block (rendered from a `fieldkit.lineage.LineageStore` if provided), a `## Methods` backlink to the anchor article, and an Orionfold LLC footer. `ArtifactManifest` is the frozen dataclass for `src/content/artifacts/<slug>.yaml` — the Phase-2 sync record per `project_artifact_manifests_phase2`; the destination renders catalog pages from `getCollection('artifacts')`. `HFHubAdapter` is a lazy wrapper around `huggingface_hub` — defaults to `dry_run=True` (stages files + logs the would-be calls; no network, no token); flip `dry_run=False` to push via `HfApi().upload_folder(...)`.
13
+
14
+ The module exists because manual card authoring at MTBM's 3–5-day cadence is the bottleneck. Every quant needs a tags list, a perplexity table, a tok/s number, a thermal envelope note, a lineage backlink — and getting any of those wrong on the customer-facing HF page is a trust hit. `fieldkit.publish` makes the card the deterministic output of the quant+lineage run, not a hand-edit, so the only knobs the operator sets are the ones that genuinely require human judgement (the upstream license, the chat format, the featured variant).
15
+
16
+ ## Public API
17
+
18
+ ```python
19
+ from fieldkit.publish import (
20
+ ARTIFACT_KINDS, ArtifactKind, ArtifactManifest,
21
+ HFHubAdapter, HFHubNotAvailable, HFAuthError,
22
+ ModelCard, PublishError, PublishResult,
23
+ publish_quant, write_artifact_manifest,
24
+ ORIONFOLD_BRAND, ORIONFOLD_HF_HANDLE, ORIONFOLD_HF_ORG,
25
+ )
26
+ ```
27
+
28
+ ### `ORIONFOLD_BRAND` + `ORIONFOLD_HF_HANDLE`
29
+
30
+ ```python
31
+ ORIONFOLD_BRAND = "Orionfold LLC"
32
+ ORIONFOLD_HF_HANDLE = "Orionfold"
33
+ ```
34
+
35
+ The brand stamped on every card footer, and the HuggingFace user handle every repo lands under (`Orionfold/<model>-GGUF`, Bartowski-shape). `ORIONFOLD_HF_ORG` is a back-compat alias for `ORIONFOLD_HF_HANDLE` — kept callable for out-of-tree imports, slated for removal in a future cut.
36
+
37
+ ### `ARTIFACT_KINDS`
38
+
39
+ ```python
40
+ ARTIFACT_KINDS = (
41
+ "quant", "lora", "adapter", "embed",
42
+ "reranker", "dataset", "space", "bench",
43
+ )
44
+ ```
45
+
46
+ The manifest `kind` enum. Mirrored by `src/content.config.ts`'s `ARTIFACT_KINDS` so Astro Zod validation and the Python writer stay in lockstep.
47
+
48
+ ### `ModelCard(...)`
49
+
50
+ Frozen dataclass + `render() → str`. Constructed by `publish_quant` from a `QuantReport`-shaped object plus the resolved license / chat_format / recommended_variant triple. Renders to a single `README.md`-style string.
51
+
52
+ Key fields:
53
+
54
+ ```python
55
+ ModelCard(
56
+ title="finance chat GGUF",
57
+ one_liner="...",
58
+ base_model="AdaptLLM/finance-chat",
59
+ license="llama2", # ← HF frontmatter scalar; reflects upstream model's license
60
+ library_name="gguf",
61
+ pipeline_tag="text-generation",
62
+ tags=("gguf", "spark-tested", "orionfold", "base_model:AdaptLLM/finance-chat"),
63
+ quant_format="gguf",
64
+ variants=({"name": "Q4_K_M", "size": "3.8 GB", "recommended": "..."}, ...),
65
+ perplexity={"Q4_K_M": 6.221, "Q8_0": 6.137, ...},
66
+ tokens_per_sec={"Q4_K_M": 31.1, "Q8_0": 8.9, ...},
67
+ sustained_load_minutes=2.18,
68
+ vertical_eval={"Q4_K_M": 0.14, ...}, # optional 5th column
69
+ vertical_eval_name="FinanceBench (n=50, numeric_match)",
70
+ hf_repo="Orionfold/finance-chat-GGUF", # drives default `## How to run` body
71
+ chat_format="llama-2", # → llama_cpp.Llama(chat_format=...)
72
+ recommended_variant="Q5_K_M", # featured in default snippets
73
+ ollama_pull_handle=None, # opt-in override; default body wins otherwise
74
+ transformers_snippet=None,
75
+ lineage_prompt=None, # injected by publish_quant if a LineageStore is supplied
76
+ article_slug="becoming-a-gguf-publisher-on-spark",
77
+ article_title="...",
78
+ model_creator=ORIONFOLD_BRAND,
79
+ )
80
+ ```
81
+
82
+ `render()` emits sections in canonical order: YAML frontmatter → title + elevator → `## Spark-tested` (omitted if no measurements) → `## Variants` → `## How to run` (auto-rendered defaults when no explicit handle/snippet given; entirely omitted if no defaults templatable) → `## Lineage` (if `lineage_prompt` supplied) → `## Methods` link → footer.
83
+
84
+ ### `ArtifactManifest(...)`
85
+
86
+ Frozen dataclass for `src/content/artifacts/<slug>.yaml`. Flat-by-design (primitive types + dicts of primitives) so the YAML emitter is hand-rolled stdlib.
87
+
88
+ ```python
89
+ m = ArtifactManifest(
90
+ slug="finance-chat-gguf",
91
+ kind="quant",
92
+ artifact_class="gguf", # serialized as `class:` in YAML
93
+ base_model="AdaptLLM/finance-chat",
94
+ hf_repo="Orionfold/finance-chat-GGUF",
95
+ variants=("Q4_K_M", "Q5_K_M", "Q6_K", "Q8_0", "F16"),
96
+ perplexity={"Q4_K_M": 6.221, ...},
97
+ spark_tokens_per_sec={"Q4_K_M": 31.09, ...},
98
+ sustained_load_minutes=2.18,
99
+ vertical_eval={"Q4_K_M": 0.14, ...},
100
+ vertical_eval_name="FinanceBench (n=50, numeric_match)",
101
+ lineage_run_id=None,
102
+ license_tier="free", # Orionfold commercial tier (free / pro)
103
+ license_commercial_tier=None,
104
+ model_license="llama2", # upstream model license (HF frontmatter shape)
105
+ article="articles/becoming-a-gguf-publisher-on-spark/",
106
+ civitai_id=None,
107
+ download_count=None,
108
+ published_at="2026-05-14T04:46:11Z",
109
+ )
110
+ print(m.to_yaml())
111
+ ```
112
+
113
+ The `license_tier` / `license_commercial_tier` fields live alongside `model_license` under a nested `license:` block in YAML output. Mac destination's Zod schema mirrors this shape.
114
+
115
+ ### `write_artifact_manifest(manifest, *, artifacts_dir)`
116
+
117
+ Writes the manifest to `<artifacts_dir>/<slug>.yaml`. Creates the directory if missing. Returns the absolute path of the written file — callers can stage it alongside the article for the next git commit.
118
+
119
+ ### `HFHubAdapter(staging_dir, *, dry_run=True, token=None, org=ORIONFOLD_HF_HANDLE)`
120
+
121
+ Thin wrapper around `huggingface_hub`. Dry-run by default: lays out the upload set on disk under `staging_dir`, logs the would-be calls. No HF imports required, no token required. Flip `dry_run=False` to push; the lazy import of `huggingface_hub` fires only then.
122
+
123
+ ```python
124
+ adapter = HFHubAdapter(staging_dir="/tmp/orionfold-stage/finance-chat", dry_run=True)
125
+ adapter.stage_text(card.render(), "README.md") # stages from a string
126
+ adapter.stage_file(gguf_path, "model-Q4_K_M.gguf") # stages by copying a file
127
+ result = adapter.push_folder(repo_name="finance-chat-GGUF")
128
+ result.dry_run # True
129
+ result.files_uploaded # ('README.md', 'model-Q4_K_M.gguf', ...)
130
+ result.logged_calls # the upload_folder kwargs that would have fired
131
+ ```
132
+
133
+ Token resolution order: explicit `token=` arg → `HF_TOKEN` env → `HUGGING_FACE_HUB_TOKEN` env → `huggingface_hub`'s cached login. If all four are absent and `dry_run=False`, `HFAuthError` raises before the network call.
134
+
135
+ ### `publish_quant(*, quant_report, base_model, repo_name, staging_dir, ...) → PublishResult`
136
+
137
+ The one-line orchestrator. Reads the duck-typed `quant_report` fields (`.format`, `.variants`, `.perplexity`, `.tokens_per_sec`, `.sustained_load_minutes`, `.variant_files`, `.vertical_eval`, `.vertical_eval_name`, `.model_license`, `.chat_format`, `.recommended_variant`), builds a `ModelCard`, stages the README + variant files, writes the `ArtifactManifest` (if `artifacts_dir` supplied), and invokes `HFHubAdapter.push_folder()`. Explicit kwargs override duck-typed report attrs.
138
+
139
+ ```python
140
+ result = publish_quant(
141
+ quant_report=report,
142
+ base_model="AdaptLLM/finance-chat",
143
+ repo_name="finance-chat-GGUF",
144
+ staging_dir="/tmp/orionfold-stage/finance-chat",
145
+ artifacts_dir="/home/nvidia/ai-field-notes/src/content/artifacts",
146
+ article_slug="becoming-a-gguf-publisher-on-spark",
147
+ article_title="...",
148
+ vertical_eval={"Q4_K_M": 0.14, "Q5_K_M": 0.16, ...},
149
+ vertical_eval_name="FinanceBench (n=50, numeric_match)",
150
+ model_license="llama2", # critical — never default silently to apache-2.0
151
+ chat_format="llama-2",
152
+ recommended_variant="Q5_K_M",
153
+ lineage_store=store, # optional; injects ## Lineage block
154
+ dry_run=True, # flip to False for the actual push
155
+ )
156
+ result.hf_repo # 'Orionfold/finance-chat-GGUF'
157
+ result.card_path # Path('/tmp/orionfold-stage/.../README.md')
158
+ result.manifest_path # Path('.../src/content/artifacts/finance-chat-gguf.yaml')
159
+ result.hf_url # None in dry-run; set after live push
160
+ ```
161
+
162
+ The `model_license` / `chat_format` / `recommended_variant` kwargs landed in v0.4.x after the `Orionfold/finance-chat-GGUF` dry-run surfaced two card-rendering bugs: a hardcoded `license: apache-2.0` (wrong for the Llama-2 lineage AdaptLLM base) and an empty `## How to run` section (when no ollama handle or transformers snippet was supplied, the section header rendered with no body). Both are now caller-controlled with sane defaults.
163
+
164
+ ## Why this surface
165
+
166
+ Three things to notice. First, `HFHubAdapter` defaults to dry-run because the right workflow is dry-run → human review → live push. Library users who want a one-shot live push pass `dry_run=False` explicitly; library users who want the staging artifact for review (the common case during development) get it for free. The `hf-publisher` skill (`/home/nvidia/.claude/skills/hf-publisher/`) wraps this workflow as a triggered Claude Code surface.
167
+
168
+ Second, `publish_quant` duck-types its report rather than importing `fieldkit.quant.QuantReport` directly. This avoids a circular import (quant doesn't depend on publish; publish doesn't depend on quant) and lets non-quant callers — a LoRA pipeline, an embedding pipeline — supply their own report-shaped objects without subclassing.
169
+
170
+ Third, `ArtifactManifest` is structurally distinct from `ModelCard` even though they overlap. The card is for HuggingFace; the manifest is for the destination Astro catalog. Both encode the same artifact, but the *consumers* are different and have different schemas. Keeping them separate dataclasses lets each evolve independently — and lets `write_artifact_manifest` write the manifest even when the HF push is dry-run, which is what the source repo commits look like during article-only iterations.
171
+
172
+ ## Samples
173
+
174
+ - [`scripts/g3_build_first_quant.sh`](https://github.com/manavsehgal/ai-field-notes/blob/main/scripts/g3_build_first_quant.sh) — `publish-dryrun` step assembles a `QuantReport`-shaped `SimpleNamespace` from the measurement JSON and calls `publish_quant(..., dry_run=True)`.
175
+ - [`scripts/g3_push_first_quant.py`](https://github.com/manavsehgal/ai-field-notes/blob/main/scripts/g3_push_first_quant.py) — the live-push one-shot. Reuses the existing dry-run stage; calls `HFHubAdapter(staging_dir=..., dry_run=False).push_folder()` directly so the 32 GB of GGUF bytes don't get re-staged.
176
+ - [`articles/becoming-a-gguf-publisher-on-spark/`](https://ainative.business/field-notes/becoming-a-gguf-publisher-on-spark/) — anchor article. Walks the v0.4.x publish surface end-to-end against `Orionfold/finance-chat-GGUF` and narrates the two bugs that v0.4.0 fixed before tagging.
@@ -0,0 +1,138 @@
1
+ ---
2
+ module: quant
3
+ title: fieldkit.quant
4
+ summary: GGUF quantize + measure pipeline — wraps llama.cpp's `convert_hf_to_gguf.py` + `llama-quantize` + `llama-perplexity` + `llama-bench`, plus a pure-stdlib `nvidia-smi` thermal probe. Emits the `QuantReport` shape `fieldkit.publish.publish_quant` consumes. Non-GGUF formats (AWQ / GPTQ / EXL3 / MLX / NVFP4) are named stubs reserving the v0.5 API surface.
5
+ order: 7
6
+ ---
7
+
8
+ ## What it is
9
+
10
+ The Spark-side production line for Orionfold GGUF cards. One module-level call (`quantize_gguf`) produces every variant — `Q4_K_M`, `Q5_K_M`, `Q6_K`, `Q8_0`, `F16` — from a HuggingFace Transformers checkpoint, using the locally-built llama.cpp binaries. Two measurement helpers (`measure_perplexity_gguf`, `measure_tokens_per_sec_gguf`) and a `ThermalProbe` collect the three numbers every Orionfold quant card carries: perplexity (vs wikitext-2), sustained `tok/s` (via `llama-bench`), and minutes-before-thermal-throttle on the GB10's GPU.
11
+
12
+ The shape exists because the v0.4 quant pipeline used to be three shell scripts that disagreed about argument names and wrote three different report formats. `fieldkit.quant` collapses them behind one `QuantReport` dataclass — the contract `fieldkit.publish.publish_quant` reads. Quantize once, measure four axes, hand the report to publish, get a model card.
13
+
14
+ Non-GGUF formats are reserved as named stubs. `quantize_awq()`, `quantize_gptq()`, `quantize_exl3()`, `quantize_mlx()`, `quantize_nvfp4()` each raise `NotImplementedError` with a one-line pointer at `ideas/mtbm-use-cases.md` §7. The stubs lock the v0.4 public surface so v0.5+ implementations slot in without an API break — callers can write code against `quantize_<format>(...)` today and pick which formats actually run later.
15
+
16
+ ## Public API
17
+
18
+ ```python
19
+ from fieldkit.quant import (
20
+ GGUFVariant, GGUF_VARIANTS, QuantFormat,
21
+ QuantReport, QuantError, LlamaCppNotFound,
22
+ LlamaCppPaths, ThermalProbe, ThermalReading,
23
+ quantize_gguf,
24
+ quantize_awq, quantize_gptq, quantize_exl3, quantize_mlx, quantize_nvfp4,
25
+ measure_perplexity_gguf,
26
+ measure_tokens_per_sec_gguf,
27
+ parse_perplexity_output,
28
+ parse_llama_bench_output,
29
+ )
30
+ ```
31
+
32
+ ### `GGUF_VARIANTS`
33
+
34
+ ```python
35
+ GGUF_VARIANTS = ("Q4_K_M", "Q5_K_M", "Q6_K", "Q8_0", "F16")
36
+ ```
37
+
38
+ The canonical Orionfold variant set (Bartowski-comparable). Order matters — perplexity tables in model cards walk this list left to right. `GGUFVariant` is type-aliased to `str` so experimental additions (`IQ4_XS`, etc.) don't require an enum bump.
39
+
40
+ ### `LlamaCppPaths`
41
+
42
+ Locator dataclass for the four llama.cpp executables: `llama-quantize`, `llama-perplexity`, `llama-bench`, and `convert_hf_to_gguf.py`. `resolve()` fills any unset field from env (`LLAMA_CPP_BIN`, `LLAMA_CPP_CONVERT`) and `which` lookups; `require(attr)` returns the path or raises `LlamaCppNotFound` with a clear remediation message.
43
+
44
+ ```python
45
+ paths = LlamaCppPaths().resolve() # populate from env + PATH
46
+ paths.require("quantize") # → Path('/home/nvidia/llama.cpp/build/bin/llama-quantize')
47
+ ```
48
+
49
+ ### `quantize_gguf(...)`
50
+
51
+ ```python
52
+ report = quantize_gguf(
53
+ model="AdaptLLM/finance-chat", # HF repo id OR local Transformers checkpoint dir
54
+ outdir="/home/nvidia/data/quants/finance-chat",
55
+ variants=("Q4_K_M", "Q5_K_M", "Q6_K", "Q8_0", "F16"),
56
+ paths=LlamaCppPaths().resolve(),
57
+ base_model_id="AdaptLLM/finance-chat", # threaded into the QuantReport
58
+ dry_run=False, # True enumerates the would-be subprocess commands
59
+ )
60
+ print(report.variant_files["Q4_K_M"])
61
+ # {'path': '/home/nvidia/data/quants/finance-chat/model-Q4_K_M.gguf', 'rel': 'model-Q4_K_M.gguf', 'size': '3.8 GB'}
62
+ ```
63
+
64
+ If the source isn't already a GGUF, `quantize_gguf` first invokes `convert_hf_to_gguf.py --outtype f16` to produce a base F16 file, then runs `llama-quantize` per variant against that intermediate. The intermediate is reused as the F16 variant of the final report — no double-conversion. `dry_run=True` enumerates the subprocess commands into `report.notes` without running them; this is the path tests + CI use to verify the orchestration without needing an 8 GB checkpoint on hand.
65
+
66
+ ### `measure_perplexity_gguf(gguf, *, corpus, paths, n_ctx=512)`
67
+
68
+ Wraps `llama-perplexity`. Returns a `float` parsed from the canonical `Final estimate: PPL = N.NNN` line, or `None` on parse failure. Cards that ship without a perplexity column use the `None` path — the rendering is forgiving (the column shows `—`).
69
+
70
+ ```python
71
+ ppl = measure_perplexity_gguf(
72
+ "/home/nvidia/data/quants/finance-chat/model-Q4_K_M.gguf",
73
+ corpus="/home/nvidia/data/calibration/wikitext-2-raw-v1/wiki.test.raw",
74
+ paths=paths,
75
+ ) # → 6.2215
76
+ ```
77
+
78
+ ### `measure_tokens_per_sec_gguf(gguf, *, paths, metric='tg', n_gpu_layers=99)`
79
+
80
+ Wraps `llama-bench`. `metric='tg'` returns text-generation `tok/s`; `metric='pp'` returns prompt-processing `tok/s`. Returns `None` on parse failure.
81
+
82
+ ```python
83
+ tg = measure_tokens_per_sec_gguf(gguf, paths=paths, metric='tg') # → 31.1
84
+ pp = measure_tokens_per_sec_gguf(gguf, paths=paths, metric='pp') # → 1111.1
85
+ ```
86
+
87
+ ### `ThermalProbe(interval_s=2.0, throttle_temp_c=83.0)`
88
+
89
+ Pure-stdlib `nvidia-smi` poll loop. Spin one in a background thread for the duration of a measurement run; on `stop()` it returns sustained-load minutes (the wall-clock time before the first sample crossed `throttle_temp_c` or hit a `clocks_throttle_reasons.hw_thermal_slowdown` flag). Per the 2026-05-12 HANDOFF Q9 decision, every Orionfold card publishes this number.
90
+
91
+ ```python
92
+ probe = ThermalProbe()
93
+ probe.start()
94
+ # ... run a long bench / inference burst
95
+ probe.stop()
96
+ print(probe.sustained_load_minutes) # → 2.18
97
+ ```
98
+
99
+ `ThermalReading` is the per-sample frozen dataclass — useful when you want the full timeseries for a per-variant chart instead of just the sustained-load floor.
100
+
101
+ ### `QuantReport`
102
+
103
+ The canonical output. `format` discriminates across formats; GGUF callers populate `variant_files` (path + rel + human-size per variant), `perplexity`, and `tokens_per_sec` dicts keyed by variant name; AWQ / GPTQ callers will populate a single-file shape when those backends land. `notes` is a free-text scratchpad — `dry_run` paths use it for the would-be commands; production runs use it for one-off observations the article will quote.
104
+
105
+ ```python
106
+ report.format # 'gguf'
107
+ report.variants # ('Q4_K_M', 'Q5_K_M', 'Q6_K', 'Q8_0', 'F16')
108
+ report.perplexity['Q8_0'] # 6.137
109
+ report.tokens_per_sec['Q4_K_M'] # 31.1
110
+ report.sustained_load_minutes # 2.18
111
+ ```
112
+
113
+ ### `parse_perplexity_output(text)` + `parse_llama_bench_output(text, metric='tg')`
114
+
115
+ The two parsing primitives, exposed in case you have llama.cpp output already in hand (e.g., from a logged run). Both return `Optional[float]`.
116
+
117
+ ### Non-GGUF stubs
118
+
119
+ ```python
120
+ quantize_awq(...) # NotImplementedError — see ideas/mtbm-use-cases.md §7 (v0.5 cut)
121
+ quantize_gptq(...)
122
+ quantize_exl3(...)
123
+ quantize_mlx(...)
124
+ quantize_nvfp4(...)
125
+ ```
126
+
127
+ Five named entry points reserving the v0.5 surface. Each raises `NotImplementedError` with a one-liner roadmap pointer. Callers writing forward-looking pipelines can shape their code today against `quantize_<format>(...)` and pick the format at runtime — the v0.5 cut wires the implementations behind the same signatures.
128
+
129
+ ## Why this surface
130
+
131
+ Three things to notice. First, every public function takes `paths=LlamaCppPaths()` as an explicit kwarg rather than reading env vars internally; this makes test runs (which pass mock paths) and production runs (which pass `LlamaCppPaths().resolve()`) the same code path. Second, the four measurement axes (perplexity, tg tok/s, pp tok/s, thermal) are *separate* helpers rather than a monolithic `measure_all`. Run only the ones you care about, in any order, with whatever subset of variants makes sense — and let the orchestration script (`scripts/g3_build_first_quant.sh measure` is the canonical one) decide the wall-time budget. Third, the non-GGUF stubs aren't error-stubs in disguise — they're a public API contract. v0.5 will fill them in; today's callers can already write `quantize_dispatch(format, ...)` against the full set.
132
+
133
+ The module sits next to `fieldkit.publish` because the two are tightly coupled: `publish_quant` reads `QuantReport` directly, and the variant-file paths it reads come straight from `report.variant_files[v]['path']`. Splitting them across modules avoids a circular import (publish doesn't import quant; it duck-types the report) while keeping the production line one `from fieldkit.quant import ...` plus one `from fieldkit.publish import ...` away.
134
+
135
+ ## Samples
136
+
137
+ - [`scripts/g3_build_first_quant.sh`](https://github.com/manavsehgal/ai-field-notes/blob/main/scripts/g3_build_first_quant.sh) — the canonical end-to-end runner. `quantize` step calls `quantize_gguf`; `measure` step calls all three measurement helpers per variant + a `ThermalProbe`; `publish-dryrun` step assembles the `QuantReport` shape and hands it to `fieldkit.publish.publish_quant(..., dry_run=True)`.
138
+ - [`articles/becoming-a-gguf-publisher-on-spark/`](https://ainative.business/field-notes/becoming-a-gguf-publisher-on-spark/) — anchor article. Walks the five-variant production line for `Orionfold/finance-chat-GGUF`, the four measurement axes, the open-book FinanceBench overlay, and the chat-vs-base-model trap that gates V1 picks.