fieldkit 0.3.0__tar.gz → 0.4.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. {fieldkit-0.3.0 → fieldkit-0.4.1}/CHANGELOG.md +99 -0
  2. {fieldkit-0.3.0 → fieldkit-0.4.1}/PKG-INFO +1 -1
  3. {fieldkit-0.3.0 → fieldkit-0.4.1}/docs/api/capabilities.md +49 -0
  4. {fieldkit-0.3.0 → fieldkit-0.4.1}/docs/api/cli.md +2 -2
  5. {fieldkit-0.3.0 → fieldkit-0.4.1}/docs/api/eval.md +69 -0
  6. {fieldkit-0.3.0 → fieldkit-0.4.1}/docs/api/nim.md +23 -0
  7. fieldkit-0.4.1/docs/api/publish.md +176 -0
  8. fieldkit-0.4.1/docs/api/quant.md +138 -0
  9. {fieldkit-0.3.0 → fieldkit-0.4.1}/docs/api/rag.md +17 -0
  10. {fieldkit-0.3.0 → fieldkit-0.4.1}/src/fieldkit/_version.py +1 -1
  11. {fieldkit-0.3.0 → fieldkit-0.4.1}/src/fieldkit/eval/__init__.py +17 -0
  12. fieldkit-0.4.1/src/fieldkit/eval/vertical.py +428 -0
  13. fieldkit-0.4.1/src/fieldkit/publish/__init__.py +982 -0
  14. fieldkit-0.4.1/src/fieldkit/quant/__init__.py +568 -0
  15. fieldkit-0.4.1/tests/test_publish.py +807 -0
  16. fieldkit-0.4.1/tests/test_quant.py +314 -0
  17. fieldkit-0.4.1/tests/test_vertical_bench.py +469 -0
  18. {fieldkit-0.3.0 → fieldkit-0.4.1}/.gitignore +0 -0
  19. {fieldkit-0.3.0 → fieldkit-0.4.1}/LICENSE +0 -0
  20. {fieldkit-0.3.0 → fieldkit-0.4.1}/README.md +0 -0
  21. {fieldkit-0.3.0 → fieldkit-0.4.1}/docs/api/lineage.md +0 -0
  22. {fieldkit-0.3.0 → fieldkit-0.4.1}/docs/api/training.md +0 -0
  23. {fieldkit-0.3.0 → fieldkit-0.4.1}/pyproject.toml +0 -0
  24. {fieldkit-0.3.0 → fieldkit-0.4.1}/samples/bench-rag.py +0 -0
  25. {fieldkit-0.3.0 → fieldkit-0.4.1}/samples/feasibility-math.py +0 -0
  26. {fieldkit-0.3.0 → fieldkit-0.4.1}/samples/hello-lineage.py +0 -0
  27. {fieldkit-0.3.0 → fieldkit-0.4.1}/samples/hello-nim.py +0 -0
  28. {fieldkit-0.3.0 → fieldkit-0.4.1}/samples/naive-rag.py +0 -0
  29. {fieldkit-0.3.0 → fieldkit-0.4.1}/src/fieldkit/__init__.py +0 -0
  30. {fieldkit-0.3.0 → fieldkit-0.4.1}/src/fieldkit/capabilities/__init__.py +0 -0
  31. {fieldkit-0.3.0 → fieldkit-0.4.1}/src/fieldkit/capabilities/data/__init__.py +0 -0
  32. {fieldkit-0.3.0 → fieldkit-0.4.1}/src/fieldkit/capabilities/data/spark-capabilities.json +0 -0
  33. {fieldkit-0.3.0 → fieldkit-0.4.1}/src/fieldkit/cli/__init__.py +0 -0
  34. {fieldkit-0.3.0 → fieldkit-0.4.1}/src/fieldkit/lineage/__init__.py +0 -0
  35. {fieldkit-0.3.0 → fieldkit-0.4.1}/src/fieldkit/nim/__init__.py +0 -0
  36. {fieldkit-0.3.0 → fieldkit-0.4.1}/src/fieldkit/rag/__init__.py +0 -0
  37. {fieldkit-0.3.0 → fieldkit-0.4.1}/src/fieldkit/training/__init__.py +0 -0
  38. {fieldkit-0.3.0 → fieldkit-0.4.1}/tests/__init__.py +0 -0
  39. {fieldkit-0.3.0 → fieldkit-0.4.1}/tests/conftest.py +0 -0
  40. {fieldkit-0.3.0 → fieldkit-0.4.1}/tests/test_capabilities.py +0 -0
  41. {fieldkit-0.3.0 → fieldkit-0.4.1}/tests/test_cli.py +0 -0
  42. {fieldkit-0.3.0 → fieldkit-0.4.1}/tests/test_eval.py +0 -0
  43. {fieldkit-0.3.0 → fieldkit-0.4.1}/tests/test_lineage.py +0 -0
  44. {fieldkit-0.3.0 → fieldkit-0.4.1}/tests/test_nim.py +0 -0
  45. {fieldkit-0.3.0 → fieldkit-0.4.1}/tests/test_nim_spark.py +0 -0
  46. {fieldkit-0.3.0 → fieldkit-0.4.1}/tests/test_rag.py +0 -0
  47. {fieldkit-0.3.0 → fieldkit-0.4.1}/tests/test_rag_spark.py +0 -0
  48. {fieldkit-0.3.0 → fieldkit-0.4.1}/tests/test_training.py +0 -0
@@ -6,6 +6,105 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and
6
6
 
7
7
  ## [Unreleased]
8
8
 
9
+ ## [0.4.1] — 2026-05-14
10
+
11
+ Patch release. The `fieldkit.eval.VerticalBench` overlay introduced in v0.4.0 needed two kwargs to score FinanceBench correctly (open-book context-prepend) and to bound a JSONL slice (subset filter on `question_type`). Both lifts came out of the 2026-05-13 V1 attempt on `AdaptLLM/finance-chat` (0/50 closed-book vs. 14–18%/50 open-book on the same JSONL) and the 2026-05-14 legal-curator scoring run on `Equall/Saul-7B-Instruct-v1`. The two scripts under `scripts/g3_*` that carried duplicated loaders now call into the package surface. No new modules, no new public classes — additive kwargs only.
12
+
13
+ ### Added — `fieldkit.eval.VerticalBench` open-book mode
14
+
15
+ - **`VerticalBench.from_jsonl(..., open_book=...)`** — new kwarg. When `True`, FinanceBench rows have their `evidence[*].evidence_text` prepended to the question (templated as "Context from <doc>: …\n\nQuestion: …\n\nAnswer with just the numeric value.") so the model sees the 10-K excerpt the gold answer was derived from. Default `None` auto-resolves to `True` for `financebench` and `False` for `legalbench` / `generic` — the right defaults per benchmark convention. Lifts inline `_load_finbench_open_book` helpers from `scripts/g3_preflight_bench.py` and `scripts/g3_measure_variants.py` into the package surface; both scripts now call `VerticalBench.from_jsonl(open_book=True, subset=…)` instead of carrying duplicated loaders. The 2026-05-13 V1 attempt on AdaptLLM/finance-chat scored 0/50 closed-book and 14–18%/50 open-book on the same JSONL — open-book is the load-bearing flag for FinanceBench scoring.
16
+ - **`VerticalBench.from_jsonl(..., subset=...)`** — new kwarg. FinanceBench-only convenience filter on the `question_type` column. Drops non-matching rows before the loader hits the `limit` cap, so callers can score the `metrics-generated` subset with `limit=50` and get 50 metrics-generated questions (not 50 mixed rows of which N are metrics-generated).
17
+
18
+ ### Test suite
19
+
20
+ **+8 new tests** on `TestOpenBook` in `tests/test_vertical_bench.py` covering: auto-default for financebench, explicit `False` keeps closed-book, missing-evidence falls back to closed-book, legalbench / generic are no-ops, list-of-strings evidence shape, subset filter, subset × limit composition. Total: **375 passed, 3 skipped** offline (`pytest -q`). The 3 skips are the two `--spark`-gated live-integration tests + the `torch`-import skip in `test_training.py` (CPU-only venv).
21
+
22
+ ### Articles in this release
23
+
24
+ - [`becoming-a-legal-curator-on-spark`](https://ainative.business/field-notes/becoming-a-legal-curator-on-spark/) — second Orionfold quant card, swaps FinanceBench for a curated 5-task LegalBench subset. Drives the `subset` kwarg's first non-finance use (LegalBench tasks via `legalbench` format) and validates that the `open_book` default-off branch is correct for LegalBench JSONLs.
25
+
26
+ ### Verified on Spark
27
+
28
+ - **Live HF push:** `Orionfold/Saul-7B-Instruct-v1-GGUF` (5 GGUF variants + README, ~37 GB) shipped 2026-05-14 via the same `publish_quant(dry_run=False)` path the finance-chat card used a week earlier. Zero source changes in `fieldkit.publish` between the two pushes — the v0.4.0 surface generalized as designed.
29
+
30
+ ## [0.4.0] — 2026-05-14
31
+
32
+ Fourth public release. Two new top-level modules (`fieldkit.publish` + `fieldkit.quant`) for the G3 GGUF / Quantization Publisher pick (MTBM Pick #1 per `ideas/mtbm-use-cases.md` §6), the v0.4.x **vertical-curator overlay** on `fieldkit.eval` (`VerticalBench`), and post-dry-run card-rendering fixes that landed the first live HF push (`Orionfold/finance-chat-GGUF`). The two new modules together unlock most of Cluster G; this cut implements the GGUF critical path and stubs the other quant formats with named entry points pointing at the v0.5+ roadmap.
33
+
34
+ ### Added — `fieldkit.publish` (new module)
35
+
36
+ HuggingFace Hub adapter + auto model card builder from `fieldkit.lineage`. Three public surfaces:
37
+
38
+ - **`fieldkit.publish.ModelCard`** — frontmatter + body builder. Renders the canonical card every Orionfold artifact gets: YAML frontmatter (license, library_name, base_model, pipeline_tag, tags, model_creator), a title + elevator, a **Spark-tested** block (per-variant perplexity + tok/s + thermal envelope), a variants table, **How to run** (`ollama pull` + `from_pretrained` snippets), an optional **Lineage** block (rendered from a `fieldkit.lineage.LineageStore` if provided), a **Methods** backlink to `ainative.business/field-notes/<slug>/`, and a footer attributing the publication to Orionfold LLC.
39
+ - **`fieldkit.publish.ArtifactManifest`** — frozen dataclass for the `src/content/artifacts/<slug>.yaml` Phase-2 sync record (per memory `project_artifact_manifests_phase2`). `to_yaml()` emits via a hand-rolled stdlib emitter so the module has no runtime YAML dep. The source repo writes one of these per push; the Mac destination renders `/artifacts/<kind>/` catalog pages from `getCollection('artifacts')`.
40
+ - **`fieldkit.publish.HFHubAdapter`** — lazy-`huggingface_hub` wrapper. Defaults to `dry_run=True` (stages files on disk, logs the would-be calls, no network). Flip `dry_run=False` to push via `HfApi().upload_folder(...)`. Token resolution order: explicit `token=` → `HF_TOKEN` env → cached login. The dry-run path is fully testable offline.
41
+
42
+ Plus an orchestrator: **`fieldkit.publish.publish_quant(...)`** — one-line caller that ingests a `QuantReport`-shaped object (duck-typed; produced by `fieldkit.quant.quantize_gguf`), renders the card, writes the manifest, stages the variant files, and pushes (or dry-runs) the HF commit.
43
+
44
+ Branded constants: `ORIONFOLD_BRAND = "Orionfold LLC"`, `ORIONFOLD_HF_HANDLE = "Orionfold"` (was `ORIONFOLD_HF_ORG = "orionfoldllc"` until 2026-05-14, when publishing moved to the existing user-account handle — Bartowski-shape personal handle precedent). Per the 2026-05-12 HANDOFF Q3 decision: Orionfold LLC is the parent brand for all AI-artifact publishing surfaces; repo names follow the Bartowski shape (`Orionfold/<model>-GGUF`, `Orionfold/<model>-LoRA`). `ORIONFOLD_HF_ORG` is retained as a back-compat alias pointing at the new constant; will be dropped at the next major cut.
45
+
46
+ ### Added — `fieldkit.quant` (new module)
47
+
48
+ Quantization dispatcher. GGUF path implemented; AWQ/GPTQ/EXL3/MLX/NVFP4 declared as named stubs pointing at the roadmap.
49
+
50
+ - **`fieldkit.quant.quantize_gguf(...)`** — wraps `llama.cpp/convert_hf_to_gguf.py` + `llama-quantize` to emit one GGUF file per requested variant (canonical Orionfold set: `Q4_K_M`, `Q5_K_M`, `Q6_K`, `Q8_0`, `F16`). Auto-derives F16 from a HF Transformers checkpoint when the source isn't already a GGUF. `dry_run=True` enumerates the would-be subprocess commands into `report.notes` without invoking them — used by tests and CI.
51
+ - **`fieldkit.quant.measure_perplexity_gguf(...)`** — wraps `llama-perplexity`. Parses output via `parse_perplexity_output()` which recognizes the standard `Final estimate: PPL = N.NNN` shape and the lowercase `perplexity = N.NNN` fallback. Returns `None` on parse failure (cards ship without a perplexity column if measurement was skipped).
52
+ - **`fieldkit.quant.measure_tokens_per_sec_gguf(...)`** — wraps `llama-bench`. Parses output via `parse_llama_bench_output()` for `tg` (text-gen, default) or `pp` (prompt-process) tok/s.
53
+ - **`fieldkit.quant.ThermalProbe`** — pure-stdlib `nvidia-smi` poll loop. Reports sustained-load minutes before throttle, per the 2026-05-12 HANDOFF Q9 decision to publish duty-cycle limits on every Orionfold card.
54
+ - **`fieldkit.quant.LlamaCppPaths`** — locator for `llama-quantize` / `llama-perplexity` / `llama-bench` / `convert_hf_to_gguf.py`. Env defaults: `LLAMA_CPP_BIN` directory, `LLAMA_CPP_CONVERT` script path. Override any field directly.
55
+ - **`fieldkit.quant.QuantReport`** — canonical dataclass output. The contract `fieldkit.publish.publish_quant()` consumes.
56
+ - **`fieldkit.quant.quantize_awq` / `quantize_gptq` / `quantize_exl3` / `quantize_mlx` / `quantize_nvfp4`** — named entry-point stubs. Raise `NotImplementedError` with a one-liner pointing at `ideas/mtbm-use-cases.md` §7. Locks the v0.4 public surface so v0.5+ implementations slot in without an API break.
57
+
58
+ ### Added — `fieldkit.eval.VerticalBench` (v0.4.x — vertical-curator overlay)
59
+
60
+ Lightweight JSONL-loader wrapper around `fieldkit.eval.Bench` for vertical-domain accuracy scoring (FinanceBench / LegalBench / SemEval / generic). Drives the **vertical-curator pivot** announced 2026-05-13 (HANDOFF §2 + `ideas/mtbm-use-cases.md` §6 Pick #1.b + §8.5.1): every Orionfold quant card now ships with a vertical-domain accuracy axis, not just wikitext perplexity. Lives in `fieldkit/src/fieldkit/eval/vertical.py`; re-exported at the package root for `from fieldkit.eval import VerticalBench`.
61
+
62
+ - **`fieldkit.eval.VerticalBench`** + **`VerticalQA`** — bench shape, JSONL loader, scorer plumbing. Accepts any `Callable[[str], str]` as the model function so subprocess (`llama-cli`), in-process (`llama-cpp-python`), or NIM-backed scoring all slot in. Per-call latency aggregates alongside accuracy + refusal via the underlying `Bench`.
63
+ - **`fieldkit.eval.VerticalBench.from_jsonl(path, format='auto', ...)`** — auto-detects `financebench` / `legalbench` / `generic` JSONL shapes from the first row's field signature. Per-row metadata (company, doc_period, question_type, task) flows into per-call tags for slice-by aggregation downstream.
64
+ - **Scorers** — `exact_match`, `contains`, `numeric_match` (with configurable `rel_tolerance`, default 1% — FinanceBench convention). The bench picks `numeric_match` by default for FinanceBench-shape JSONL, `exact_match` for LegalBench-shape.
65
+
66
+ ### Added — license + How-to-run defaults on `fieldkit.publish` (v0.4.x — `Orionfold/finance-chat-GGUF` dry-run found two card bugs)
67
+
68
+ - **`ModelCard.license`** is now reachable from `publish_quant(..., model_license=...)` (and the duck-typed `quant_report.model_license` attribute). Previously the kwarg didn't exist and every card defaulted to `apache-2.0` — wrong for any Llama / Gemma / Qwen / CC-BY-NC base. AdaptLLM/finance-chat now correctly publishes with `license: llama2`.
69
+ - **`ArtifactManifest.model_license`** mirrors the same value into the Astro manifest under `license.model:`. Astro Zod schema (`src/content.config.ts`) extended with `license.model: z.string().optional()` so destination catalog pages and HF badges stay in sync. The `license.tier:` field (commercial-distribution tier — `free` / `pro`) stays distinct from this upstream-license field.
70
+ - **`ModelCard.hf_repo`** + **`ModelCard.chat_format`** + **`ModelCard.recommended_variant`** — three new fields that drive an auto-rendered default `## How to run` body. Before this fix, cards with no explicit `ollama_pull_handle` / `transformers_snippet` rendered an empty section header (the second finance-chat bug). The new renderer auto-builds three code blocks templated from `hf_repo` + a featured variant: `huggingface-cli download`, `llama-server` (OpenAI-compatible serve), and `llama-cpp-python` (in-process, threading `chat_format` if set). When all three new fields are absent + no explicit handle/snippet supplied, the section is omitted entirely (no more empty headers).
71
+ - **`publish_quant(..., model_license=, chat_format=, recommended_variant=)`** kwargs added — orchestrate all three through to card + manifest. Same duck-typed fallback through `quant_report` attributes.
72
+ - **`scripts/g3_build_first_quant.sh`** — `MODEL_LICENSE` / `CHAT_FORMAT` / `RECOMMENDED_VARIANT` env knobs added with case-statement overrides (`AdaptLLM/finance-chat → llama2 + llama-2`). Default `MODEL_LICENSE=apache-2.0` + `RECOMMENDED_VARIANT=Q5_K_M` for greenfield runs.
73
+ - **`scripts/g3_push_first_quant.py`** (new) — one-shot live-push helper that reuses the existing dry-run stage (no 32 GB re-copy via `publish_quant(dry_run=False)`); calls `HFHubAdapter.push_folder()` directly. Bakes in xet-safety env (`HF_HOME=/home/nvidia/data/.hf-cache` + `HF_HUB_DISABLE_XET=1`) per the Spark-side `~/.cache/huggingface/` permission landmine; sources `HF_TOKEN` from `.env.local` (chmod 600).
74
+ - **+11 tests** (full suite: 379 passed, 2 skipped offline). Covers: model_license override flow, default apache-2.0 fallback, default GGUF How-to-run rendering, `recommended_variant` override, `hf_repo`-less skip-section behavior, manifest `license.model` emission.
75
+
76
+ ### Added — vertical-eval surface on `fieldkit.publish`
77
+
78
+ `ModelCard` + `ArtifactManifest` + `publish_quant(...)` extended to thread per-variant vertical-eval scores through to the rendered card and the Phase-2 sync manifest:
79
+
80
+ - **`ModelCard.vertical_eval: dict[str, float]`** + **`ModelCard.vertical_eval_name: str`** — when set, the **Spark-tested** block renders a 5-column table (Variant / Size / Perplexity / tok/s / *Vertical-eval-name*) instead of the 4-column default, and the introductory copy switches from "measurement triple" to "measurement quad". Accuracy values render as percentages (`62.0%`). Cards without vertical eval render identically to v0.4.0 — backwards-compatible.
81
+ - **`ArtifactManifest.vertical_eval` + `vertical_eval_name`** — written into the YAML manifest under the same key names. Mac destination Zod schema (`src/content.config.ts`) extended to accept both. Manifests without vertical eval skip the field entirely.
82
+ - **`publish_quant(..., vertical_eval=, vertical_eval_name=)`** — explicit kwargs override whatever the duck-typed `quant_report` carries. Useful when scoring happens out-of-band from quantization (the canonical path on Spark: quantize 5 variants → measure each variant via `g3_measure_variants.py`, which calls `VerticalBench.run(llama_cli_fn)` and then feeds the resulting accuracy dict back into `publish_quant`).
83
+
84
+ ### Schema changes
85
+
86
+ - `src/content.config.ts` — `FIELDKIT_MODULES` extended to include `'quant'` and `'publish'` in canonical order (`capabilities, nim, rag, eval, training, lineage, quant, publish, cli`).
87
+ - `src/content.config.ts` — new `artifacts` Astro collection (Phase 2 sync contract). Loads YAML manifests from `src/content/artifacts/*.yaml`; Zod schema mirrors `fieldkit.publish.ArtifactManifest`. `ARTIFACT_KINDS` enum exposed alongside `FIELDKIT_MODULES` for downstream filtering. `src/content/artifacts/` directory created (empty + `.gitkeep`); first manifest will land when the first quant ships.
88
+ - `src/content.config.ts` — `artifacts` schema extended with optional `vertical_eval: Record<string, number>` + `vertical_eval_name: string` (vertical-curator pivot 2026-05-13).
89
+
90
+ ### Test suite
91
+
92
+ **130 new tests** across `tests/test_publish.py` (42, +16 from v0.4 scaffold incl. +11 for the model_license + How-to-run defaults fix), `tests/test_quant.py` (37), and `tests/test_vertical_bench.py` (39, new file), plus targeted regression coverage. Total: **379 passed, 2 skipped** offline (`pytest -q`). The 2 skips are `--spark`-gated live integration tests (chat NIM + pgvector); the v0.3 torch module-level skip has been resolved by lazy-importing torch only inside the training entry points. All new tests run offline — `dry_run=True` paths for `HFHubAdapter`, `publish_quant`, and `quantize_gguf` exercise the full code path without `huggingface_hub`, llama.cpp binaries, or `nvidia-smi` available. `VerticalBench` tests run without a model — `model_fn` is a callable, so a plain `lambda` exercises the full scoring + bench-aggregation path.
93
+
94
+ ### Articles in this release
95
+
96
+ - [`becoming-a-gguf-publisher-on-spark`](https://ainative.business/field-notes/becoming-a-gguf-publisher-on-spark/) — G3 v0 anchor article. 3,388 words; documents the five-variant `Orionfold/finance-chat-GGUF` release end-to-end (Spark-tested perplexity / tok/s / sustained-load minutes / FinanceBench accuracy across F16, Q8_0, Q6_K, Q5_K_M, Q4_K_M) plus the V0 preflight-bench gate and the V1 chat-vs-continued-pretrain lesson. `hf_url:` frontmatter threads the live HF receipt onto the article.
97
+
98
+ ### Verified on Spark
99
+
100
+ - **Live HF push:** `Orionfold/finance-chat-GGUF` shipped 2026-05-14 at <https://huggingface.co/Orionfold/finance-chat-GGUF> — 5 GGUF variants + auto-rendered README in 1h 57min. Repo returns HTTP 200, all 6 files present. `publish_quant(dry_run=False)` path exercised end-to-end.
101
+ - **Five-variant measurement card** (F16 / Q8_0 / Q6_K / Q5_K_M / Q4_K_M) with the four Spark-tested axes — perplexity (wikitext-2), tg + pp tok/s (`llama-bench`), sustained-load minutes (`ThermalProbe` via `nvidia-smi`), and FinanceBench accuracy (n=50, `numeric_match`, open-book) — all produced via `fieldkit.quant.measure_*` + `fieldkit.eval.VerticalBench.run(...)` on GB10.
102
+
103
+ ### Deferred to v0.5
104
+
105
+ - `fieldkit.image-lora` + `fieldkit.civitai` — Pick #2 (G9) prep. Deferred per the 2026-05-12 HANDOFF Q10 decision to sequence G3 → G9 rather than parallelize. Will land once G3 v0 proves the `fieldkit.publish` infra.
106
+ - Non-GGUF formats in `fieldkit.quant` (AWQ, GPTQ, EXL3, MLX, NVFP4). The G3 v0 niche-positioning is Nemotron-family GGUFs with the Spark-tested layer; other formats are pure surface-area expansion and can wait for an audience signal.
107
+
9
108
  ## [0.3.0] — 2026-05-11
10
109
 
11
110
  Third public release. One new top-level module (`fieldkit.lineage`) lifted from the [auto-research-loop-on-spark article](https://ainative.business/field-notes/auto-research-loop-on-spark/) — the portable part of cxcscmu's *Auto-Research-Recipes* harness, decomposed into a pure-stdlib substrate any harness on the Spark can write into.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: fieldkit
3
- Version: 0.3.0
3
+ Version: 0.4.1
4
4
  Summary: Verified-on-Spark patterns lifted from the ai-field-notes blog into one importable Python package.
5
5
  Project-URL: Homepage, https://ainative.business/fieldkit/
6
6
  Project-URL: Source, https://github.com/manavsehgal/ai-field-notes/tree/main/fieldkit
@@ -80,6 +80,55 @@ practical_inference_envelope("70B params fp8")
80
80
 
81
81
  Raises `UnknownEnvelope` if no rule matches.
82
82
 
83
+ ### Supporting types
84
+
85
+ The `Capabilities` view is composed of three frozen dataclasses. You normally read them off `Capabilities.load()` rather than constructing them directly, but the types are re-exported for type-hinting and structural pattern-matching.
86
+
87
+ #### `Hardware`
88
+
89
+ ```python
90
+ @dataclass(frozen=True, slots=True)
91
+ class Hardware:
92
+ name: str # "DGX Spark"
93
+ unified_memory_gb: int # 128
94
+ memory_topology: str # "unified CPU+GPU"
95
+ compute_arch: str # "GB10 Grace Blackwell"
96
+ supported_dtypes: tuple[str, ...] # ("fp32", "bf16", "fp16", ...)
97
+ interconnect_to_other_gpus: str
98
+ ```
99
+
100
+ Reachable as `Capabilities.load().hardware`. Use it to gate code paths on `unified_memory_gb` or `compute_arch` without re-parsing the JSON.
101
+
102
+ #### `MemoryBudgetRulesOfThumb`
103
+
104
+ ```python
105
+ @dataclass(frozen=True, slots=True)
106
+ class MemoryBudgetRulesOfThumb:
107
+ param_bytes: dict[str, float] # mirrors DTYPE_BYTES
108
+ training_overhead_multiplier: str
109
+ kv_cache_per_token_per_layer: str
110
+ practical_inference_envelope: dict[str, str] # {"8B params bf16": "..."}
111
+ practical_finetune_envelope: dict[str, str]
112
+ ```
113
+
114
+ Backs `practical_inference_envelope()`. Inspect `caps.memory_budget_rules_of_thumb.practical_finetune_envelope` directly when you want the fine-tune table instead of the inference one.
115
+
116
+ #### `StackEntry`
117
+
118
+ ```python
119
+ @dataclass(frozen=True, slots=True)
120
+ class StackEntry:
121
+ id: str # "nim", "nemo", "trt-llm", ...
122
+ label: str
123
+ purpose: str
124
+ verified_in_articles: tuple[str, ...] = ()
125
+ known_limits: tuple[str, ...] = ()
126
+ fits_paper_shapes: tuple[str, ...] = ()
127
+ supported_models_at_spark_scale: tuple[str, ...] = ()
128
+ ```
129
+
130
+ One entry per Spark-relevant stack component. `frontier-scout` uses `fits_paper_shapes` to decide whether a paper's training recipe matches a stack we have running notes for; the `verified_in_articles` tuple links back into ai-field-notes slugs that proved a given stack on the box.
131
+
83
132
  ### `DTYPE_BYTES`
84
133
 
85
134
  Bytes-per-parameter table:
@@ -2,7 +2,7 @@
2
2
  module: cli
3
3
  title: fieldkit (CLI)
4
4
  summary: A thin Typer wrapper over the modules. Quick checks and smoke benchmarks without writing Python.
5
- order: 6
5
+ order: 9
6
6
  ---
7
7
 
8
8
  ## What it is
@@ -17,7 +17,7 @@ Print the installed package version.
17
17
 
18
18
  ```bash
19
19
  $ fieldkit version
20
- 0.2.0
20
+ 0.4.0
21
21
  ```
22
22
 
23
23
  ### `fieldkit envelope <size>`
@@ -9,6 +9,12 @@ order: 4
9
9
 
10
10
  The eval harnesses the project keeps reinventing: a per-call latency benchmarker that emits the same JSON shape as `articles/*/evidence/benchmark.py`, an LLM-as-judge with the three rubrics from `rag-eval-ragas-and-nemo-evaluator`, a trajectory analyzer for agent-loop JSONL, and a refusal regex catalog unioned across the project's articles.
11
11
 
12
+ **v0.4.x additions** (vertical-curator surface for the G3 GGUF publisher pipeline):
13
+
14
+ - `VerticalBench` — Spark-overlay scorer for FinanceBench / LegalBench / SemEval-style JSONL test sets. Wraps `Bench`, so latency aggregates alongside accuracy and refusal. Network access lives in the caller (`llama-cli`, NIM, vLLM) — the bench itself is offline-only and unit-testable.
15
+ - `VerticalQA` — one test case (qid + question + expected + tags) lifted from a vertical-eval JSONL.
16
+ - `exact_match` / `contains` / `numeric_match` — the three built-in scorers. `numeric_match` is the FinanceBench default (first-number ±1% rel-tol); `exact_match` is the LegalBench default; `contains` is the right pick when the model answers in prose around a key fact.
17
+
12
18
  **v0.2 additions** (verifier-loop and agent-bench primitives):
13
19
 
14
20
  - `AssertionGrader` — pure file-system grader over five assertion primitives (`file_exists`, `file_not_exists`, `file_contents_contain`, `file_contents_match_regex`, `file_unchanged`). Lifted from `clawgym-on-spark`'s deterministic grader.
@@ -44,6 +50,10 @@ from fieldkit.eval import (
44
50
 
45
51
  # v0.2 — matched-base comparison
46
52
  MatchedBaseComparison, MatchedBaseComparisonResult, GroupStats,
53
+
54
+ # v0.4.x — vertical-curator surface
55
+ VerticalBench, VerticalQA,
56
+ contains, exact_match, numeric_match,
47
57
  )
48
58
  ```
49
59
 
@@ -218,6 +228,65 @@ json.dump(result.to_dict(), open("comparison.json", "w"), indent=2)
218
228
 
219
229
  `MatchedBaseComparison.stats(rows)` is exposed separately when you only need single-rollout aggregation (no comparison). Accepts a list/iterable of dicts or a JSONL path.
220
230
 
231
+ ### `VerticalBench(name, questions, scorer=exact_match, ...)` *(v0.4.x)*
232
+
233
+ Spark-overlay scorer for vertical-domain test sets — FinanceBench, LegalBench, SemEval-style JSONL — that the G3 GGUF publisher pipeline uses as its fourth measurement axis alongside perplexity, tok/s, and sustained-load minutes.
234
+
235
+ The bench is intentionally callable-shaped: it accepts a `model_fn(prompt) -> str` and times each call via the existing `Bench` harness, so latency aggregates alongside accuracy and refusal. Network access lives in the caller (llama-cli, NIM, vLLM), keeping the bench offline-only for unit tests.
236
+
237
+ ```python
238
+ from fieldkit.eval import VerticalBench, numeric_match
239
+
240
+ vb = VerticalBench.from_jsonl(
241
+ "financebench.jsonl",
242
+ scorer=numeric_match, # FinanceBench → first-number ±1%
243
+ limit=50,
244
+ )
245
+
246
+ def model_fn(prompt: str) -> str:
247
+ return llama_cli_call(gguf_path, prompt)
248
+
249
+ bench = vb.run(model_fn, extra_tags={"variant": "Q4_K_M"})
250
+ print(bench.report()) # accuracy + refusal_rate + latency
251
+ ```
252
+
253
+ `VerticalBench.from_jsonl(path, *, format="auto", limit=None, scorer=None, scorer_kwargs=None)` auto-sniffs FinanceBench / LegalBench / generic schemas from the first JSON row. Rows missing the question or expected field are silently dropped (the row-count delta vs the JSONL is the diagnostic). The default scorer is `numeric_match` for FinanceBench and `exact_match` everywhere else; pass `scorer=` to override.
254
+
255
+ `VerticalBench.run(model_fn, *, limit=None, on_error="record", extra_tags=None)` returns the underlying `Bench` so callers route through the existing `.summary()` / `.report()` / `.dump()` pipeline. Each `BenchCall` carries `accuracy` (0.0/1.0 from the scorer) and `refusal` (0.0/1.0 from `is_refusal`) metrics; per-row metadata (company, doc_period, question_type) flows through to `BenchCall.tags` for downstream slice-by aggregation.
256
+
257
+ `VerticalBench.summary()` produces a lightweight `{name, n, scorer, tag_keys}` dict without invoking the model — useful in the lineage entry recording *what* the bench will measure before the model has actually run.
258
+
259
+ ### `VerticalQA` *(v0.4.x)*
260
+
261
+ ```python
262
+ @dataclass(frozen=True, slots=True)
263
+ class VerticalQA:
264
+ qid: str # FinanceBench `financebench_id`, etc.
265
+ question: str
266
+ expected: str
267
+ tags: dict[str, Any] = field(default_factory=dict)
268
+ ```
269
+
270
+ One vertical-eval test case. The `qid` is the row's stable id so per-row scores can be cross-referenced against the source JSONL; `tags` carry per-row metadata (company, doc_period, question_type) that flow through to `Bench` for slice-by aggregation downstream.
271
+
272
+ ### Scorers — `exact_match` / `contains` / `numeric_match` *(v0.4.x)*
273
+
274
+ Pluggable `Callable[[predicted, expected], float]` returning 1.0 / 0.0. Pass any custom callable into `VerticalBench(scorer=...)`; the three built-ins cover the dominant patterns:
275
+
276
+ ```python
277
+ exact_match("yes", "Yes") # 1.0 — whitespace + case-insensitive
278
+ contains("The 2023 revenue was $4.5B.", "$4.5B") # 1.0 — substring match
279
+ numeric_match("Revenue was $4.55B", "4.5B") # 1.0 — first number, ±1% rel-tol
280
+ numeric_match("Revenue was $4.55B", "4.5B",
281
+ rel_tolerance=0.001) # 0.0 — tighter tol
282
+ ```
283
+
284
+ | Scorer | When to use it |
285
+ |---|---|
286
+ | `exact_match(p, e)` | LegalBench-style single-label classification (`yes` / `no` / `hold` / `overrule`). Whitespace- and case-insensitive. |
287
+ | `contains(p, e)` | The model is asked to answer in prose and the reference is a key fact/number/phrase that must appear somewhere in the answer. |
288
+ | `numeric_match(p, e, *, rel_tolerance=0.01)` | FinanceBench-style quantitative answers. Extracts the first number from each side (commas stripped), compares under relative tolerance. Defaults to ±1% per FinanceBench's grading convention. Returns 0.0 if either side has no parseable number — including refusals, so the refusal counter elsewhere doesn't need to gate this scorer. |
289
+
221
290
  ## Samples
222
291
 
223
292
  - [`samples/bench-rag.py`](https://github.com/manavsehgal/ai-field-notes/blob/main/fieldkit/samples/bench-rag.py) — offline `Bench` + `Judge.parse` walkthrough.
@@ -62,6 +62,29 @@ chunks = chunk_text(long_doc, max_tokens=900)
62
62
 
63
63
  Polls `/models` until 200 or timeout. Returns `True` on success, `False` on timeout. Use it as the first call in any sample script that talks to a cold NIM.
64
64
 
65
+ ### `ChatMessage`
66
+
67
+ Type alias for the OpenAI-style chat message shape `NIMClient.chat()` consumes:
68
+
69
+ ```python
70
+ ChatMessage = dict[str, Any]
71
+ # Concretely: {"role": "system" | "user" | "assistant", "content": str | list[...]}
72
+ ```
73
+
74
+ Exported so callers can type-hint their own helpers that build message arrays without importing `Any` plumbing:
75
+
76
+ ```python
77
+ from fieldkit.nim import ChatMessage, NIMClient
78
+
79
+ def build_rag_prompt(question: str, chunks: list[str]) -> list[ChatMessage]:
80
+ return [
81
+ {"role": "system", "content": "Answer from the provided context only."},
82
+ {"role": "user", "content": "\n\n".join(chunks) + "\n\nQ: " + question},
83
+ ]
84
+ ```
85
+
86
+ The alias is intentionally permissive — content may be a string, a list of multimodal parts, or any provider-specific extension. Schema validation is left to the NIM server.
87
+
65
88
  ### Context-overflow preflight
66
89
 
67
90
  `NIMClient.chat()` runs a token-estimate check on its message list and raises `NIMContextOverflowError(estimated_tokens, ceiling)` **before any network call** when the request would exceed `NIM_CONTEXT_WINDOW = 8192`. The opaque NIM 400 from `project_spark_nim_context_window` never surfaces.
@@ -0,0 +1,176 @@
1
+ ---
2
+ module: publish
3
+ title: fieldkit.publish
4
+ summary: HuggingFace push surface — `ModelCard` (frontmatter + body renderer), `ArtifactManifest` (Phase-2 sync record), `HFHubAdapter` (lazy huggingface_hub wrapper, dry-run by default), `publish_quant` orchestrator. Every Orionfold artifact card carries the same Spark-tested measurement quad (perplexity, tok/s, thermal envelope, optional vertical-eval) — this module is what makes that shape deterministic.
5
+ order: 8
6
+ ---
7
+
8
+ ## What it is
9
+
10
+ The publishing side of the Orionfold production line. `fieldkit.quant` produces a `QuantReport`; `fieldkit.publish` turns it into a HuggingFace repo with a deterministic model card and a per-artifact YAML manifest the source repo and destination site both read.
11
+
12
+ Three surfaces. `ModelCard` renders the canonical card shape — frontmatter (license, library_name, base_model, tags, model_creator), a `## Spark-tested` block (perplexity + tok/s + thermal envelope + optional vertical-eval table), a `## Variants` table, an auto-generated `## How to run` body (`huggingface-cli download` + `llama-server` + `llama-cpp-python` snippets templated from the HF repo path), an optional `## Lineage` block (rendered from a `fieldkit.lineage.LineageStore` if provided), a `## Methods` backlink to the anchor article, and an Orionfold LLC footer. `ArtifactManifest` is the frozen dataclass for `src/content/artifacts/<slug>.yaml` — the Phase-2 sync record per `project_artifact_manifests_phase2`; the destination renders catalog pages from `getCollection('artifacts')`. `HFHubAdapter` is a lazy wrapper around `huggingface_hub` — defaults to `dry_run=True` (stages files + logs the would-be calls; no network, no token); flip `dry_run=False` to push via `HfApi().upload_folder(...)`.
13
+
14
+ The module exists because manual card authoring at MTBM's 3–5-day cadence is the bottleneck. Every quant needs a tags list, a perplexity table, a tok/s number, a thermal envelope note, a lineage backlink — and getting any of those wrong on the customer-facing HF page is a trust hit. `fieldkit.publish` makes the card the deterministic output of the quant+lineage run, not a hand-edit, so the only knobs the operator sets are the ones that genuinely require human judgement (the upstream license, the chat format, the featured variant).
15
+
16
+ ## Public API
17
+
18
+ ```python
19
+ from fieldkit.publish import (
20
+ ARTIFACT_KINDS, ArtifactKind, ArtifactManifest,
21
+ HFHubAdapter, HFHubNotAvailable, HFAuthError,
22
+ ModelCard, PublishError, PublishResult,
23
+ publish_quant, write_artifact_manifest,
24
+ ORIONFOLD_BRAND, ORIONFOLD_HF_HANDLE, ORIONFOLD_HF_ORG,
25
+ )
26
+ ```
27
+
28
+ ### `ORIONFOLD_BRAND` + `ORIONFOLD_HF_HANDLE`
29
+
30
+ ```python
31
+ ORIONFOLD_BRAND = "Orionfold LLC"
32
+ ORIONFOLD_HF_HANDLE = "Orionfold"
33
+ ```
34
+
35
+ The brand stamped on every card footer, and the HuggingFace user handle every repo lands under (`Orionfold/<model>-GGUF`, Bartowski-shape). `ORIONFOLD_HF_ORG` is a back-compat alias for `ORIONFOLD_HF_HANDLE` — kept callable for out-of-tree imports, slated for removal in a future cut.
36
+
37
+ ### `ARTIFACT_KINDS`
38
+
39
+ ```python
40
+ ARTIFACT_KINDS = (
41
+ "quant", "lora", "adapter", "embed",
42
+ "reranker", "dataset", "space", "bench",
43
+ )
44
+ ```
45
+
46
+ The manifest `kind` enum. Mirrored by `src/content.config.ts`'s `ARTIFACT_KINDS` so Astro Zod validation and the Python writer stay in lockstep.
47
+
48
+ ### `ModelCard(...)`
49
+
50
+ Frozen dataclass + `render() → str`. Constructed by `publish_quant` from a `QuantReport`-shaped object plus the resolved license / chat_format / recommended_variant triple. Renders to a single `README.md`-style string.
51
+
52
+ Key fields:
53
+
54
+ ```python
55
+ ModelCard(
56
+ title="finance chat GGUF",
57
+ one_liner="...",
58
+ base_model="AdaptLLM/finance-chat",
59
+ license="llama2", # ← HF frontmatter scalar; reflects upstream model's license
60
+ library_name="gguf",
61
+ pipeline_tag="text-generation",
62
+ tags=("gguf", "spark-tested", "orionfold", "base_model:AdaptLLM/finance-chat"),
63
+ quant_format="gguf",
64
+ variants=({"name": "Q4_K_M", "size": "3.8 GB", "recommended": "..."}, ...),
65
+ perplexity={"Q4_K_M": 6.221, "Q8_0": 6.137, ...},
66
+ tokens_per_sec={"Q4_K_M": 31.1, "Q8_0": 8.9, ...},
67
+ sustained_load_minutes=2.18,
68
+ vertical_eval={"Q4_K_M": 0.14, ...}, # optional 5th column
69
+ vertical_eval_name="FinanceBench (n=50, numeric_match)",
70
+ hf_repo="Orionfold/finance-chat-GGUF", # drives default `## How to run` body
71
+ chat_format="llama-2", # → llama_cpp.Llama(chat_format=...)
72
+ recommended_variant="Q5_K_M", # featured in default snippets
73
+ ollama_pull_handle=None, # opt-in override; default body wins otherwise
74
+ transformers_snippet=None,
75
+ lineage_prompt=None, # injected by publish_quant if a LineageStore is supplied
76
+ article_slug="becoming-a-gguf-publisher-on-spark",
77
+ article_title="...",
78
+ model_creator=ORIONFOLD_BRAND,
79
+ )
80
+ ```
81
+
82
+ `render()` emits sections in canonical order: YAML frontmatter → title + elevator → `## Spark-tested` (omitted if no measurements) → `## Variants` → `## How to run` (auto-rendered defaults when no explicit handle/snippet given; entirely omitted if no defaults templatable) → `## Lineage` (if `lineage_prompt` supplied) → `## Methods` link → footer.
83
+
84
+ ### `ArtifactManifest(...)`
85
+
86
+ Frozen dataclass for `src/content/artifacts/<slug>.yaml`. Flat-by-design (primitive types + dicts of primitives) so the YAML emitter is hand-rolled stdlib.
87
+
88
+ ```python
89
+ m = ArtifactManifest(
90
+ slug="finance-chat-gguf",
91
+ kind="quant",
92
+ artifact_class="gguf", # serialized as `class:` in YAML
93
+ base_model="AdaptLLM/finance-chat",
94
+ hf_repo="Orionfold/finance-chat-GGUF",
95
+ variants=("Q4_K_M", "Q5_K_M", "Q6_K", "Q8_0", "F16"),
96
+ perplexity={"Q4_K_M": 6.221, ...},
97
+ spark_tokens_per_sec={"Q4_K_M": 31.09, ...},
98
+ sustained_load_minutes=2.18,
99
+ vertical_eval={"Q4_K_M": 0.14, ...},
100
+ vertical_eval_name="FinanceBench (n=50, numeric_match)",
101
+ lineage_run_id=None,
102
+ license_tier="free", # Orionfold commercial tier (free / pro)
103
+ license_commercial_tier=None,
104
+ model_license="llama2", # upstream model license (HF frontmatter shape)
105
+ article="articles/becoming-a-gguf-publisher-on-spark/",
106
+ civitai_id=None,
107
+ download_count=None,
108
+ published_at="2026-05-14T04:46:11Z",
109
+ )
110
+ print(m.to_yaml())
111
+ ```
112
+
113
+ The `license_tier` / `license_commercial_tier` fields live alongside `model_license` under a nested `license:` block in YAML output. Mac destination's Zod schema mirrors this shape.
114
+
115
+ ### `write_artifact_manifest(manifest, *, artifacts_dir)`
116
+
117
+ Writes the manifest to `<artifacts_dir>/<slug>.yaml`. Creates the directory if missing. Returns the absolute path of the written file — callers can stage it alongside the article for the next git commit.
118
+
119
+ ### `HFHubAdapter(staging_dir, *, dry_run=True, token=None, org=ORIONFOLD_HF_HANDLE)`
120
+
121
+ Thin wrapper around `huggingface_hub`. Dry-run by default: lays out the upload set on disk under `staging_dir`, logs the would-be calls. No HF imports required, no token required. Flip `dry_run=False` to push; the lazy import of `huggingface_hub` fires only then.
122
+
123
+ ```python
124
+ adapter = HFHubAdapter(staging_dir="/tmp/orionfold-stage/finance-chat", dry_run=True)
125
+ adapter.stage_text(card.render(), "README.md") # stages from a string
126
+ adapter.stage_file(gguf_path, "model-Q4_K_M.gguf") # stages by copying a file
127
+ result = adapter.push_folder(repo_name="finance-chat-GGUF")
128
+ result.dry_run # True
129
+ result.files_uploaded # ('README.md', 'model-Q4_K_M.gguf', ...)
130
+ result.logged_calls # the upload_folder kwargs that would have fired
131
+ ```
132
+
133
+ Token resolution order: explicit `token=` arg → `HF_TOKEN` env → `HUGGING_FACE_HUB_TOKEN` env → `huggingface_hub`'s cached login. If all four are absent and `dry_run=False`, `HFAuthError` raises before the network call.
134
+
135
+ ### `publish_quant(*, quant_report, base_model, repo_name, staging_dir, ...) → PublishResult`
136
+
137
+ The one-line orchestrator. Reads the duck-typed `quant_report` fields (`.format`, `.variants`, `.perplexity`, `.tokens_per_sec`, `.sustained_load_minutes`, `.variant_files`, `.vertical_eval`, `.vertical_eval_name`, `.model_license`, `.chat_format`, `.recommended_variant`), builds a `ModelCard`, stages the README + variant files, writes the `ArtifactManifest` (if `artifacts_dir` supplied), and invokes `HFHubAdapter.push_folder()`. Explicit kwargs override duck-typed report attrs.
138
+
139
+ ```python
140
+ result = publish_quant(
141
+ quant_report=report,
142
+ base_model="AdaptLLM/finance-chat",
143
+ repo_name="finance-chat-GGUF",
144
+ staging_dir="/tmp/orionfold-stage/finance-chat",
145
+ artifacts_dir="/home/nvidia/ai-field-notes/src/content/artifacts",
146
+ article_slug="becoming-a-gguf-publisher-on-spark",
147
+ article_title="...",
148
+ vertical_eval={"Q4_K_M": 0.14, "Q5_K_M": 0.16, ...},
149
+ vertical_eval_name="FinanceBench (n=50, numeric_match)",
150
+ model_license="llama2", # critical — never default silently to apache-2.0
151
+ chat_format="llama-2",
152
+ recommended_variant="Q5_K_M",
153
+ lineage_store=store, # optional; injects ## Lineage block
154
+ dry_run=True, # flip to False for the actual push
155
+ )
156
+ result.hf_repo # 'Orionfold/finance-chat-GGUF'
157
+ result.card_path # Path('/tmp/orionfold-stage/.../README.md')
158
+ result.manifest_path # Path('.../src/content/artifacts/finance-chat-gguf.yaml')
159
+ result.hf_url # None in dry-run; set after live push
160
+ ```
161
+
162
+ The `model_license` / `chat_format` / `recommended_variant` kwargs landed in v0.4.x after the `Orionfold/finance-chat-GGUF` dry-run surfaced two card-rendering bugs: a hardcoded `license: apache-2.0` (wrong for the Llama-2 lineage AdaptLLM base) and an empty `## How to run` section (when no ollama handle or transformers snippet was supplied, the section header rendered with no body). Both are now caller-controlled with sane defaults.
163
+
164
+ ## Why this surface
165
+
166
+ Three things to notice. First, `HFHubAdapter` defaults to dry-run because the right workflow is dry-run → human review → live push. Library users who want a one-shot live push pass `dry_run=False` explicitly; library users who want the staging artifact for review (the common case during development) get it for free. The `hf-publisher` skill (`/home/nvidia/.claude/skills/hf-publisher/`) wraps this workflow as a triggered Claude Code surface.
167
+
168
+ Second, `publish_quant` duck-types its report rather than importing `fieldkit.quant.QuantReport` directly. This avoids a circular import (quant doesn't depend on publish; publish doesn't depend on quant) and lets non-quant callers — a LoRA pipeline, an embedding pipeline — supply their own report-shaped objects without subclassing.
169
+
170
+ Third, `ArtifactManifest` is structurally distinct from `ModelCard` even though they overlap. The card is for HuggingFace; the manifest is for the destination Astro catalog. Both encode the same artifact, but the *consumers* are different and have different schemas. Keeping them separate dataclasses lets each evolve independently — and lets `write_artifact_manifest` write the manifest even when the HF push is dry-run, which is what the source repo commits look like during article-only iterations.
171
+
172
+ ## Samples
173
+
174
+ - [`scripts/g3_build_first_quant.sh`](https://github.com/manavsehgal/ai-field-notes/blob/main/scripts/g3_build_first_quant.sh) — `publish-dryrun` step assembles a `QuantReport`-shaped `SimpleNamespace` from the measurement JSON and calls `publish_quant(..., dry_run=True)`.
175
+ - [`scripts/g3_push_first_quant.py`](https://github.com/manavsehgal/ai-field-notes/blob/main/scripts/g3_push_first_quant.py) — the live-push one-shot. Reuses the existing dry-run stage; calls `HFHubAdapter(staging_dir=..., dry_run=False).push_folder()` directly so the 32 GB of GGUF bytes don't get re-staged.
176
+ - [`articles/becoming-a-gguf-publisher-on-spark/`](https://ainative.business/field-notes/becoming-a-gguf-publisher-on-spark/) — anchor article. Walks the v0.4.x publish surface end-to-end against `Orionfold/finance-chat-GGUF` and narrates the two bugs that v0.4.0 fixed before tagging.