PyPI - fieldkit - Versions diffs - 0.2.0__tar.gz → 0.3.0__tar.gz - Mend

fieldkit 0.2.0tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

{fieldkit-0.2.0 → fieldkit-0.3.0}/CHANGELOG.md RENAMED Viewed

@@ -6,6 +6,40 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and
 ## [Unreleased]
+## [0.3.0] — 2026-05-11
+Third public release. One new top-level module (`fieldkit.lineage`) lifted from the [auto-research-loop-on-spark article](https://ainative.business/field-notes/auto-research-loop-on-spark/) — the portable part of cxcscmu's *Auto-Research-Recipes* harness, decomposed into a pure-stdlib substrate any harness on the Spark can write into.
+### Added — `fieldkit.lineage` (new module)
+The portable part of cxcscmu's *Auto-Research-Recipes* harness, extracted into a top-level submodule. The case for the primitive is in the released `pg_ablation_lineage_on` vs `pg_ablation_lineage_off` runs: same agent, same prompt template, same 201-trial budget on Parameter Golf — only whether the agent's session prompt includes the rendered lineage block differs. With lineage on: 16 keeps (8.0%), 38 eval-budget overruns. Without: 3 keeps (1.5%), 123 eval-budget overruns. **5.3× more keeps · 3.2× fewer wall-wastes**, with no model change, no compute change, no prompt-template change. ([extract from #auto-research-loop-on-spark])
+The new module is pure-stdlib (no torch, no numpy) — ~200 LOC of public surface, ~330 LOC including docstrings + renderer helpers.
+- **`fieldkit.lineage.FailureLabel`** — 10-class string enum (`keep`, `discard`, `crash`, `eval_budget_overrun`, `train_budget_overrun`, `size_blocked`, `preflight_crash`, `harness_abort`, `disqualified`, `baseline`). `.value` round-trips byte-identically to cxcscmu TSVs. The `is_informational` property is the cxcscmu `_QUARANTINED_STATUSES` rule as a method — returns `False` only for `harness_abort` (bookkeeping kills); every other class carries usable signal for the next agent.
+- **`fieldkit.lineage.Trial`** — frozen dataclass for one TSV row. 17 fields in canonical order. `core_metric` is the task-agnostic primary metric (so the module works for Parameter Golf, NanoChat-D12, CIFAR, and any future task in the arc); `val_bpb` is preserved alongside for direct interop with cxcscmu-shaped data. `Trial.header()` / `Trial.to_row()` / `Trial.from_row(dict)` give exact TSV round-trip — `None` floats serialize as empty strings (matches cxcscmu convention).
+- **`fieldkit.lineage.LineageStore(root, *, lower_is_better=True)`** — append-only TSV writer at `root/results.tsv` with `fcntl.flock` exclusive locking across header + row writes (concurrent specialists can write without interleaving). Read-side accessors: `all_trials()`, `latest(n)`, `best()`, `chain_to(exp_id)` (walks `parent_exp` pointers root-first, terminates on missing or self-referential parents), and `render_prompt(...)` — the deterministic Markdown emitter.
+- **`fieldkit.lineage.LineageSnapshot`** — frozen dataclass returned by `render_prompt`. Carries the rendered Markdown string plus the underlying structured data (`current_best`, `chain_to_best`, `top_k_leaderboard`, `recent_n_activity`, `last_m_with_full_hypothesis`) so callers can index in without re-parsing.
+- **`fieldkit.lineage.RecipeEdit`** — pairs a keep trial with its workdir `snapshot_path` and `parent_snapshot_path`. `diff()` computes a unified diff of every text file in the snapshot vs the parent (binary files elide with a `Binary files ... differ` marker); baseline trials with no parent return an empty diff.
+Rendered Markdown output mirrors cxcscmu's `release_artifacts/example_lineage_pg_lineage_on_arch.txt` shape: header line + `## LEADERBOARD.md` (current best + top-K kept table) + `## KNOWLEDGE.md` (current-best lineage as a nested `└─` chain + recent-activity table + last-M detailed entries). Determinism is tested — same TSV state in produces byte-identical Markdown across calls.
+### Test suite
+**29 new tests** for `fieldkit.lineage` (`tests/test_lineage.py`): `FailureLabel` value parity + `is_informational` predicate + 10-class enum surface lock; `Trial` round-trip via TSV; `LineageStore` append / latest / best / `chain_to` correctness across linear and branched topologies; `render_prompt` determinism, top-K filtering, chain rendering with `← BEST` marker; `RecipeEdit.diff()` against parent snapshots including new-file detection.
+Total fieldkit test count: **249 passed, 3 skipped** offline (`pytest -q`) — the 3 skips are 1 module-level torch importorskip in `test_training.py` and 2 `--spark`-gated live integration tests.
+### Articles in this release
+- [`auto-research-loop-on-spark`](https://ainative.business/field-notes/auto-research-loop-on-spark/) — anchor article. Walks the 17-column schema, the 10-class enum semantics, and the cxcscmu lineage ablation that proves the primitive's value.
+### Schema change — `FIELDKIT_MODULES`
+`src/content.config.ts` extended to include `'lineage'` in the `FIELDKIT_MODULES` tuple (order: `capabilities, nim, rag, eval, training, lineage, cli`). Required so articles can declare `fieldkit_modules: ['lineage']` in their frontmatter.
+[extract from #auto-research-loop-on-spark]: https://github.com/manavsehgal/ai-field-notes/tree/main/articles/auto-research-loop-on-spark
 ## [0.2.0] — 2026-05-05
 Second public release. One new module (`fieldkit.training`) plus four extensions to the v0.1 `fieldkit.eval` surface, all lifted from articles in [ai-field-notes](https://ainative.business/field-notes/) — primarily the `clawgym-on-spark` and Frontier Scout arcs. The `fieldkit.agents` and `fieldkit.inference` modules originally targeted for v0.2 are deferred to v0.3+ because their public APIs need a second article's use case to lock in (see "Deferred to v0.3+" below).

{fieldkit-0.2.0 → fieldkit-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: fieldkit
-Version: 0.2.0
+Version: 0.3.0
 Summary: Verified-on-Spark patterns lifted from the ai-field-notes blog into one importable Python package.
 Project-URL: Homepage, https://ainative.business/fieldkit/
 Project-URL: Source, https://github.com/manavsehgal/ai-field-notes/tree/main/fieldkit
@@ -39,7 +39,7 @@ Description-Content-Type: text/markdown
 > Verified-on-Spark patterns lifted from the [ai-field-notes](https://ainative.business/field-notes/) blog into one importable Python package.
-Every essay in `ai-field-notes` ends with `evidence/` — a folder of working code that produced the article's numbers. After 25+ articles the same patterns kept reappearing: the same NIM client wrapper, the same chunk-embed-store dance, the same bench harness. `fieldkit` is what those `evidence/` folders look like once the boilerplate is lifted into a real package.
+Every essay in `ai-field-notes` ends with `evidence/` — a folder of working code that produced the article's numbers. After 30+ articles the same patterns kept reappearing: the same NIM client wrapper, the same chunk-embed-store dance, the same bench harness, the same verifier-loop math. `fieldkit` is what those `evidence/` folders look like once the boilerplate is lifted into a real package.
 The blog stays the long-form rationale. `fieldkit` is the `pip install`-able surface so you can reproduce — and extend — the work without re-pasting 80 lines of NIM-client setup per article.
@@ -52,7 +52,7 @@ pip install fieldkit
 For the bleeding edge between releases, install from the git tag instead:
 ```bash
-pip install "git+https://github.com/manavsehgal/ai-field-notes.git@fieldkit/v0.1.0#subdirectory=fieldkit"
+pip install "git+https://github.com/manavsehgal/ai-field-notes.git@fieldkit/v0.2.0#subdirectory=fieldkit"
 ```
 ## Quickstart
@@ -64,21 +64,32 @@ client = NIMClient(base_url="http://localhost:8000/v1", model="meta/llama-3.1-8b
 print(client.chat([{"role": "user", "content": "Hello, Spark."}]))
 ```
-## What's in v0.1.0
+## What's in v0.2.0
 | Module | Purpose | Source articles |
 |---|---|---|
 | `fieldkit.capabilities` | Typed Python facade over `spark-capabilities.json` — KV cache math, weight bytes, inference envelope. | `kv-cache-arithmetic-at-inference`, `gpu-sizing-math-for-fine-tuning` |
 | `fieldkit.nim` | OpenAI-compatible NIM client wrapper with retry, chunking, and the 8192-token context guard. | `nim-first-inference-dgx-spark` and friends |
 | `fieldkit.rag` | `Pipeline(embed_url, rerank_url, pgvector_dsn, generator)` — ingest → retrieve → rerank → fuse. | `naive-rag-on-spark` and friends |
-| `fieldkit.eval` | `Bench`, `Judge`, `Trajectory` — the recurring eval harness shapes. | every article with a `bench.py` or `benchmark.py` |
+| `fieldkit.eval` | `Bench`, `Judge`, `Trajectory` — plus v0.2's `AssertionGrader`, `PassAtK`, `AgentRun`, `MatchedBaseComparison`. | every article with a `bench.py` or `benchmark.py`, plus `clawgym-on-spark`, `autoresearchbench-on-spark`, `pass-at-k-after-the-seventh-patch` |
+| `fieldkit.training` *(new in v0.2)* | `LoraReferenceSnapshot` (sidesteps peft 0.19's offloader bug), `WeightDeltaTracker` — for any RL or SFT loop. Lazy `torch` import; pure-inference envs don't pay. | `clawgym-on-spark-grpo` |
 | `fieldkit.cli` | `fieldkit bench rag`, `fieldkit feasibility <id>`, `fieldkit envelope <size>`. | discoverability |
-Modules deferred to `v0.2`: `retriever`, `ft`, `guardrails`, `agents`. To `v0.3`: `train`, `observe`.
+### What v0.2 adds
+- **`fieldkit.training`** — new module. `LoraReferenceSnapshot` is a CPU-resident snapshot of a peft adapter's LoRA tensors plus a context manager that swaps the snapshot in for one no-grad forward pass and restores trainable weights on exit. Solves a real peft 0.19 bug: `model.load_adapter(adapter_name="reference", is_trainable=False)` crashes with `KeyError` under `device_map="auto"` whenever the GPU has anything else resident — peft's offload-detection over-triggers on Spark unified memory. `WeightDeltaTracker` is a pre/post snapshot of trainable params with L2 + max|Δ| reporting — sanity-check that any fine-tuning step actually moved weights.
+- **`fieldkit.eval.AssertionGrader`** — pure-function grader over five file-system assertion primitives (`file_exists`, `file_not_exists`, `file_contents_contain`, `file_contents_match_regex`, `file_unchanged`). Lifted from `clawgym-on-spark`'s deterministic grader; no LLM, no fuzzy matching.
+- **`fieldkit.eval.PassAtK` + `pass_at_k_estimator`** — verifier-loop with the Chen 2021 unbiased pass@k estimator (lower variance than the naive `1 - (1-p)^k` for finite n).
+- **`fieldkit.eval.AgentRun` + `TurnDetail` + `summarize_agent_runs`** — per-question agent-bench schema with overrideable field-name path tuples for non-AutoResearchBench layouts.
+- **`fieldkit.eval.MatchedBaseComparison` + `GroupStats`** — two-rollout B−A driver with per-group and per-assertion-kind delta and a markdown `.report()`. Reusable for any LoRA / adapter ablation, fine-tuned-vs-base, or system-prompt-A-vs-B comparison.
+**Deferred to v0.3+:** `fieldkit.agents` (Persona / WorkspaceSeed / SynthTask / TaskAuthor / Sandbox / RolloutDriver / Trajectory + TurnRecord — 7 symbols), `fieldkit.inference.VLLMClient`, and `replay_messages_from_trajectory`. Each needs a second consuming article before its public API locks.
 ## Hardware
-`v0.1` is **Spark-only**. Every code path is verified on a DGX Spark (GB10, 128 GB unified memory, NIM 8B + embed NIM + pgvector co-resident). Portability to other CUDA 12.x boxes lands in `v0.2+` when there's demand.
+Every code path is verified on a DGX Spark (GB10, 128 GB unified memory, NIM 8B + embed NIM + pgvector co-resident). `fieldkit.training`'s torch + safetensors imports are lazy, so the package costs nothing on inference-only boxes — install `torch` and `safetensors` yourself in the training environment when you need the training primitives. NeMo / Triton / pytorch-base containers ship them; pure-inference envs don't.
+Portability to non-Spark CUDA 12.x boxes lands when there's demand.
 ## License

fieldkit-0.3.0/README.md ADDED Viewed

@@ -0,0 +1,66 @@
+# fieldkit
+> Verified-on-Spark patterns lifted from the [ai-field-notes](https://ainative.business/field-notes/) blog into one importable Python package.
+Every essay in `ai-field-notes` ends with `evidence/` — a folder of working code that produced the article's numbers. After 30+ articles the same patterns kept reappearing: the same NIM client wrapper, the same chunk-embed-store dance, the same bench harness, the same verifier-loop math. `fieldkit` is what those `evidence/` folders look like once the boilerplate is lifted into a real package.
+The blog stays the long-form rationale. `fieldkit` is the `pip install`-able surface so you can reproduce — and extend — the work without re-pasting 80 lines of NIM-client setup per article.
+## Install
+```bash
+pip install fieldkit
+```
+For the bleeding edge between releases, install from the git tag instead:
+```bash
+pip install "git+https://github.com/manavsehgal/ai-field-notes.git@fieldkit/v0.2.0#subdirectory=fieldkit"
+```
+## Quickstart
+```python
+from fieldkit.nim import NIMClient
+client = NIMClient(base_url="http://localhost:8000/v1", model="meta/llama-3.1-8b-instruct")
+print(client.chat([{"role": "user", "content": "Hello, Spark."}]))
+```
+## What's in v0.2.0
+| Module | Purpose | Source articles |
+|---|---|---|
+| `fieldkit.capabilities` | Typed Python facade over `spark-capabilities.json` — KV cache math, weight bytes, inference envelope. | `kv-cache-arithmetic-at-inference`, `gpu-sizing-math-for-fine-tuning` |
+| `fieldkit.nim` | OpenAI-compatible NIM client wrapper with retry, chunking, and the 8192-token context guard. | `nim-first-inference-dgx-spark` and friends |
+| `fieldkit.rag` | `Pipeline(embed_url, rerank_url, pgvector_dsn, generator)` — ingest → retrieve → rerank → fuse. | `naive-rag-on-spark` and friends |
+| `fieldkit.eval` | `Bench`, `Judge`, `Trajectory` — plus v0.2's `AssertionGrader`, `PassAtK`, `AgentRun`, `MatchedBaseComparison`. | every article with a `bench.py` or `benchmark.py`, plus `clawgym-on-spark`, `autoresearchbench-on-spark`, `pass-at-k-after-the-seventh-patch` |
+| `fieldkit.training` *(new in v0.2)* | `LoraReferenceSnapshot` (sidesteps peft 0.19's offloader bug), `WeightDeltaTracker` — for any RL or SFT loop. Lazy `torch` import; pure-inference envs don't pay. | `clawgym-on-spark-grpo` |
+| `fieldkit.cli` | `fieldkit bench rag`, `fieldkit feasibility <id>`, `fieldkit envelope <size>`. | discoverability |
+### What v0.2 adds
+- **`fieldkit.training`** — new module. `LoraReferenceSnapshot` is a CPU-resident snapshot of a peft adapter's LoRA tensors plus a context manager that swaps the snapshot in for one no-grad forward pass and restores trainable weights on exit. Solves a real peft 0.19 bug: `model.load_adapter(adapter_name="reference", is_trainable=False)` crashes with `KeyError` under `device_map="auto"` whenever the GPU has anything else resident — peft's offload-detection over-triggers on Spark unified memory. `WeightDeltaTracker` is a pre/post snapshot of trainable params with L2 + max|Δ| reporting — sanity-check that any fine-tuning step actually moved weights.
+- **`fieldkit.eval.AssertionGrader`** — pure-function grader over five file-system assertion primitives (`file_exists`, `file_not_exists`, `file_contents_contain`, `file_contents_match_regex`, `file_unchanged`). Lifted from `clawgym-on-spark`'s deterministic grader; no LLM, no fuzzy matching.
+- **`fieldkit.eval.PassAtK` + `pass_at_k_estimator`** — verifier-loop with the Chen 2021 unbiased pass@k estimator (lower variance than the naive `1 - (1-p)^k` for finite n).
+- **`fieldkit.eval.AgentRun` + `TurnDetail` + `summarize_agent_runs`** — per-question agent-bench schema with overrideable field-name path tuples for non-AutoResearchBench layouts.
+- **`fieldkit.eval.MatchedBaseComparison` + `GroupStats`** — two-rollout B−A driver with per-group and per-assertion-kind delta and a markdown `.report()`. Reusable for any LoRA / adapter ablation, fine-tuned-vs-base, or system-prompt-A-vs-B comparison.
+**Deferred to v0.3+:** `fieldkit.agents` (Persona / WorkspaceSeed / SynthTask / TaskAuthor / Sandbox / RolloutDriver / Trajectory + TurnRecord — 7 symbols), `fieldkit.inference.VLLMClient`, and `replay_messages_from_trajectory`. Each needs a second consuming article before its public API locks.
+## Hardware
+Every code path is verified on a DGX Spark (GB10, 128 GB unified memory, NIM 8B + embed NIM + pgvector co-resident). `fieldkit.training`'s torch + safetensors imports are lazy, so the package costs nothing on inference-only boxes — install `torch` and `safetensors` yourself in the training environment when you need the training primitives. NeMo / Triton / pytorch-base containers ship them; pure-inference envs don't.
+Portability to non-Spark CUDA 12.x boxes lands when there's demand.
+## License
+Apache-2.0. See [`LICENSE`](https://github.com/manavsehgal/ai-field-notes/blob/main/fieldkit/LICENSE).
+## Links
+- **Blog:** https://ainative.business/field-notes/
+- **Docs:** https://ainative.business/fieldkit/
+- **Source:** https://github.com/manavsehgal/ai-field-notes/tree/main/fieldkit
+- **Changelog:** [`CHANGELOG.md`](https://github.com/manavsehgal/ai-field-notes/blob/main/fieldkit/CHANGELOG.md)

{fieldkit-0.2.0 → fieldkit-0.3.0}/docs/api/cli.md RENAMED Viewed

@@ -1,13 +1,13 @@
 ---
 module: cli
 title: fieldkit (CLI)
-summary: A thin Typer wrapper over the four modules. Quick checks and smoke benchmarks without writing Python.
-order: 5
+summary: A thin Typer wrapper over the modules. Quick checks and smoke benchmarks without writing Python.
+order: 6
 ---
 ## What it is
-A thin command-line entry point exposed at `fieldkit` after `pip install`. Every subcommand is a ~20-line wrapper over the existing module APIs — for real workloads, import `fieldkit.{capabilities,nim,rag,eval}` directly instead.
+A thin command-line entry point exposed at `fieldkit` after `pip install`. Every subcommand is a ~20-line wrapper over the existing module APIs — for real workloads, import `fieldkit.{capabilities,nim,rag,eval,training}` directly instead.
 ## Commands
@@ -17,7 +17,7 @@ Print the installed package version.
 ```bash
 $ fieldkit version
-0.1.0.dev0
+0.2.0
 ```
 ### `fieldkit envelope <size>`

fieldkit-0.3.0/docs/api/eval.md ADDED Viewed

@@ -0,0 +1,224 @@
+---
+module: eval
+title: fieldkit.eval
+summary: Bench, Judge, Trajectory, the project's refusal detector — plus the v0.2 verifier-loop additions (AssertionGrader, PassAtK, AgentRun, MatchedBaseComparison) for agent + RL benchmarks.
+order: 4
+---
+## What it is
+The eval harnesses the project keeps reinventing: a per-call latency benchmarker that emits the same JSON shape as `articles/*/evidence/benchmark.py`, an LLM-as-judge with the three rubrics from `rag-eval-ragas-and-nemo-evaluator`, a trajectory analyzer for agent-loop JSONL, and a refusal regex catalog unioned across the project's articles.
+**v0.2 additions** (verifier-loop and agent-bench primitives):
+- `AssertionGrader` — pure file-system grader over five assertion primitives (`file_exists`, `file_not_exists`, `file_contents_contain`, `file_contents_match_regex`, `file_unchanged`). Lifted from `clawgym-on-spark`'s deterministic grader.
+- `PassAtK` + `pass_at_k_estimator` — verifier-loop with the Chen 2021 unbiased pass@k estimator. Lifted from the `pass-at-k-after-the-seventh-patch` follow-up.
+- `AgentRun` + `TurnDetail` + `summarize_agent_runs` — per-question agent-bench schema with overrideable field-name path tuples for non-AutoResearchBench layouts. Lifted from `autoresearchbench-on-spark`.
+- `MatchedBaseComparison` + `GroupStats` + `MatchedBaseComparisonResult` — two-rollout B−A driver with per-group + per-assertion-kind delta and a markdown `.report()`. Lifted from the `clawgym-on-spark` Phase 5 SFT-vs-base eval.
+## Public API
+```python
+from fieldkit.eval import (
+    # v0.1
+    Bench, BenchCall,
+    Judge, JudgeResult, JudgeError,
+    Trajectory, TrajectoryIter,
+    RUBRIC_CORRECTNESS, RUBRIC_FAITHFULNESS, RUBRIC_RELEVANCE,
+    BUILTIN_RUBRICS,
+    REFUSAL_PATTERNS,
+    is_refusal,
+    summarize_metric,
+    # v0.2 — assertion grader
+    ASSERTION_KINDS,
+    AssertionGrader, AssertionResult, GradeResult,
+    # v0.2 — pass@k
+    PassAtK, PassAtKResult,
+    pass_at_k_estimator,
+    # v0.2 — agent runs
+    AgentRun, TurnDetail,
+    summarize_agent_runs,
+    # v0.2 — matched-base comparison
+    MatchedBaseComparison, MatchedBaseComparisonResult, GroupStats,
+)
+```
+### `Bench(name, metrics, metrics_key=None)`
+Wall-clock benchmark with numeric metric aggregation. Emits the same `{summary: {...}, calls: [...]}` JSON shape the article evidence files use.
+```python
+from fieldkit.eval import Bench
+with Bench("naive-rag",
+           metrics=["embed", "retrieve", "generate_total", "end_to_end"],
+           metrics_key="timings_ms") as b:
+    b.run(pipe.ask, questions, tag_fn=lambda q: {"kind": classify(q)})
+print(b.report())                         # markdown table
+b.dump("benchmark.json")                  # full JSON
+```
+Exceptions in the callable are caught and recorded with `success=False` so a single bad input doesn't sink the sweep. Pass `on_error="raise"` to abort on first failure.
+### `Judge(client: NIMClient, rubric=RUBRIC_CORRECTNESS, ...)`
+LLM-as-judge wrapping any `NIMClient`. Three built-in rubrics: `correctness`, `faithfulness`, `relevance`.
+```python
+from fieldkit.eval import Judge
+from fieldkit.nim import NIMClient
+with NIMClient(base_url="http://localhost:8000/v1",
+               model="meta/llama-3.1-8b-instruct") as c:
+    judge = Judge.builtin(c, "correctness")
+    result = judge.grade(
+        question="How much unified memory does the Spark have?",
+        prediction="128 GB",
+        reference="128 GB",
+    )
+    print(result.score, result.rationale)
+```
+`Judge.parse(raw)` is a static helper that does JSON-then-regex score extraction (handles `{"score": 4, ...}`, fenced ```json blocks, and `"score: 4"` prose forms). Score is `None` iff parsing failed.
+### `Trajectory(iters, baseline=None, score_field="val_bpb", lower_is_better=True)`
+Agent-loop JSONL analyzer. Knob coverage, repeat rate, mode dominance, cumulative best.
+```python
+from fieldkit.eval import Trajectory
+traj = Trajectory.from_jsonl(
+    "trajectory.jsonl",
+    score_field="val_bpb",
+    lower_is_better=True,
+)
+traj.knob_coverage()        # {knob_name: count, ...}
+traj.repeat_rate()          # 0.0 .. 1.0
+traj.mode_dominance()       # {mode: fraction, ...}
+traj.cumulative_best()      # list[float]
+```
+Permissive parser drops malformed lines silently — the agent loop emits intermediate `proposed`/`failed` records too.
+### `is_refusal(text) -> bool`
+Catches "context does not contain the answer", "I do not know", "not specified", and other refusal patterns unioned from `rag-eval-ragas-and-nemo-evaluator` and `lora-on-your-own-qa-pairs`.
+### `AssertionGrader()` *(v0.2)*
+Pure-function grader over five file-system assertion primitives — no LLM, no fuzzy matching, no scoring. The five supported kinds are listed in `ASSERTION_KINDS`; an unknown kind fails the assertion with `"unknown kind: <k>"` rather than crashing the grade.
+```python
+from pathlib import Path
+from fieldkit.eval import AssertionGrader
+grader = AssertionGrader()
+result = grader.grade(
+    task,                                 # SynthTask-shaped dict OR bare list
+    post_state_root=Path("/tmp/sandbox-N"),
+)
+print(result.passed, result.n_passed, result.n_total)
+```
+`task` accepts either a SynthTask-shaped dict (must have `verifiable_assertions`; may have `task_id` and `workspace_seed.files`, the latter auto-populates `seed_files` for `file_unchanged` checks) or a bare list of assertion dicts (each with `kind`, `path`, plus kind-specific keys like `must_contain` / `regex`). Pass `seed_files=` explicitly to enforce `file_unchanged`; without it those assertions report "skipped (no seed content)" and count as pass.
+`GradeResult` is JSON-serializable via `.to_dict()` and carries per-assertion outcomes plus the binary AND across all assertions. `AssertionResult.detail` is empty on pass; on failure it records the proximate cause (missing path, regex did not match, divergent contents, etc.) so a grade dump is debuggable without re-running the rollout.
+### `PassAtK(ks=(1,))` and `pass_at_k_estimator(n, c, k)` *(v0.2)*
+Verifier-loop primitive: pass@k from per-task n-sample grades, using the **Chen et al. (2021) unbiased estimator** `1 - C(n-c, k) / C(n, k)`. Lower variance than the naive `1 - (1-p)^k` for finite n; the naive form silently over-estimates when c is small relative to n.
+```python
+from fieldkit.eval import PassAtK
+pak = PassAtK(ks=(1, 8))
+result = pak.score(
+    problems=[{"task_id": "HumanEval/0", "test": "...", ...}, ...],
+    samples=[["sample1", "sample2", ...], ...],   # K per problem
+    grader=lambda text, problem: humaneval_run(text, problem),
+)
+print(result.pass_at)            # {1: 0.7050, 8: 0.8415}
+```
+`samples` is a sequence-of-sequences with one fixed sample count across problems; `PassAtK.score` raises if they diverge. `extras_fn(problem, samples) -> dict` is an optional hook for attaching per-problem metadata (first-sample tail, decode-token counts, etc.) onto each `per_task` row without bloating the grader interface.
+When you've already graded the rollout offline (e.g. you have a `comparison.json` from a prior bench), use `pak.from_rows(rows)` with pre-counted `(task_id, n, passed)` triples to skip re-grading.
+The standalone `pass_at_k_estimator(n, c, k)` is exported separately for callers who already have `(n, c)` rows.
+### `AgentRun` + `TurnDetail` + `summarize_agent_runs(runs)` *(v0.2)*
+Canonical schema for any third-party agent bench that emits a per-question record with a status, total wall time, and a list of turn dicts. Covers AutoResearchBench, autoresearch-agent-loop, and clawgym-on-spark rollouts out of the box; field-name path tuples on `from_record` cover the rest.
+```python
+from fieldkit.eval import AgentRun, summarize_agent_runs
+runs = AgentRun.from_jsonl(
+    "evidence/runs/llama-3.1-8b/inference_output.jsonl"
+)
+print(summarize_agent_runs(runs, label="llama-3.1-8b"))
+# Custom bench shape — override the path tuples
+custom = AgentRun.from_record(
+    raw,
+    question_id_field="task_id",
+    question_id_path=(),                   # top-level
+    inference_path=("result",),            # not inference_results[0]
+    turns_field="trace",
+)
+```
+`TurnDetail` keeps five canonical fields (`turn`, `action`, `duration_s`, `input_tokens`, `output_tokens`) and stuffs everything else from the source record into `extras` so the canonical accessors stay stable while bench-specific fields (`papers_retrieved`, `parse_errors`, `candidate_cfg`) survive round-tripping.
+Convenience accessors on `AgentRun` are pure derivations of `turns`: `tool_calls()` (action == "tool"), `tool_format_errors()` (action == "error"), `total_input_tokens()`, `total_output_tokens()`, `succeeded()` (status == "finished" AND ≥1 candidate). Override `succeeded()` for benches with different success semantics.
+`summarize_agent_runs(runs, label="...")` aggregates per-status counts plus `summarize_metric` rollups for `wall_seconds`, `turns`, `candidates`, `tool_calls`, `tool_format_errors`. Mirrors the JSON shape `articles/autoresearchbench-on-spark/scripts/analyze_run.py` writes — pass straight to `json.dumps`.
+### `MatchedBaseComparison(group_extractor=...)` *(v0.2)*
+Two-rollout B−A comparison over a held-out task set. The "filter held-out by training-set membership → run rollout twice with different `--model` → emit B − A comparison" pattern is reusable for any LoRA / adapter ablation — GRPO-vs-SFT, fine-tuned-vs-base, system-prompt-A-vs-B.
+Trajectory record schema (one dict per task):
+```json
+{
+    "task_id": "synth-<persona>-NN",
+    "final_grade": {
+        "passed": true,
+        "n_passed": 3,
+        "n_total": 3,
+        "assertions": [{"kind": "file_exists", "passed": true}, ...]
+    },
+    "stopped": "task_complete",
+    "n_turns": 5,
+    "wall_seconds": 12.3
+}
+```
+```python
+from fieldkit.eval import MatchedBaseComparison
+import json
+cmp = MatchedBaseComparison()
+result = cmp.compare(
+    baseline=base_trajectories,    # list of dicts OR path/JSONL
+    candidate=sft_trajectories,
+)
+print(result.report())             # markdown headline + per-group + per-kind
+json.dump(result.to_dict(), open("comparison.json", "w"), indent=2)
+```
+`group_extractor` defaults to a synth-persona splitter (`synth-data-science-researcher-03 → data-science-researcher`); pass any `Callable[[str], str]` for arxiv-id prefixes, Bench question categories, or other task-id schemes. Set to `None` to disable per-group breakdown.
+`GroupStats` aggregates one rollout: total + per-passed task counts, per-assertion totals, `by_group` and `by_kind` buckets, stop-reason histogram, mean turns, mean wall. `MatchedBaseComparisonResult.overall_delta` carries the headline four numbers — task and per-assertion deltas in percentage points, plus mean-turns and mean-wall deltas. `.report()` renders a markdown summary table; `.to_dict()` serializes the full comparison for `comparison.json` files.
+`MatchedBaseComparison.stats(rows)` is exposed separately when you only need single-rollout aggregation (no comparison). Accepts a list/iterable of dicts or a JSONL path.
+## Samples
+- [`samples/bench-rag.py`](https://github.com/manavsehgal/ai-field-notes/blob/main/fieldkit/samples/bench-rag.py) — offline `Bench` + `Judge.parse` walkthrough.
+- [`articles/naive-rag-on-spark/evidence/benchmark.py`](https://github.com/manavsehgal/ai-field-notes/blob/main/articles/naive-rag-on-spark/evidence/benchmark.py) — the original article's benchmark, rewritten on top of `fieldkit.eval.Bench`. Reproduces the same behavioral fingerprint: 5 of 6 refusals (incl. the canonical Google-IPO false refusal) plus the Ian Thorpe grounded answer.

fieldkit-0.3.0/docs/api/lineage.md ADDED Viewed

@@ -0,0 +1,118 @@
+---
+module: lineage
+title: fieldkit.lineage
+summary: Append-only trial log + deterministic prompt rendering — the portable part of cxcscmu's Auto-Research-Recipes harness. A 17-column TSV per trial, a 10-class status enum, and the Markdown lineage block the next specialist reads at session entry.
+order: 6
+---
+## What it is
+The release_artifacts pattern from cxcscmu's *Auto-Research-Recipes* harness, decomposed into Python. Four dataclasses, one enum, ~200 lines of pure-stdlib code — and a determinism contract: same TSV state in, same rendered Markdown out.
+The case for the primitive sits in cxcscmu's own `pg_ablation_lineage_on` vs `pg_ablation_lineage_off` runs. Same agent. Same prompt template. Same 201 trials of search budget on Parameter Golf. Same Claude Opus on each specialist. The only difference is whether the agent's session prompt includes the rendered lineage block. With lineage on: 16 keeps (8.0%), 38 eval-budget overruns (19%), best `val_bpb` 1.073142. With lineage off: 3 keeps (1.5%), **123 eval-budget overruns (61%)**, best `val_bpb` 1.077413. **5.3× more keeps · 3.2× fewer wall-wastes · 0.004 val_bpb deeper.** The intervention isn't the agent. The intervention is letting the agent see what was tried.
+`fieldkit.lineage` is the portable substrate that lets you give that intervention to your own loops — no model weights, no GPUs, no NIM containers, no Claude budget. A TSV writer with `fcntl.flock` for concurrent specialist writes, a small enum, a deterministic Markdown renderer.
+## Public API
+```python
+from fieldkit.lineage import (
+    FailureLabel,
+    Trial,
+    RecipeEdit,
+    LineageSnapshot,
+    LineageStore,
+)
+```
+### `FailureLabel`
+String-valued enum with 10 classes; `value` round-trips identically to cxcscmu's TSV `status` column.
+| value | meaning |
+|---|---|
+| `keep` | Trial ran to completion, improved the leaderboard, snapshot archived |
+| `discard` | Trial ran to completion, didn't improve — informational, the clean failure mode |
+| `crash` | Trial died mid-run (exception, OOM, NCCL error) |
+| `eval_budget_overrun` | Trained inside budget, eval phase exceeded its wall — partial signal |
+| `train_budget_overrun` | Training phase exceeded its wall |
+| `size_blocked` | Killed by an artifact-size constraint |
+| `preflight_crash` | Died before the trial proper started (infrastructure) |
+| `harness_abort` | Bookkeeping kill (the only non-informational class) |
+| `disqualified` | Vision-side: completed but failed a structural gate (CIFAR) |
+| `baseline` | The seed every run starts from |
+The `is_informational` property returns `False` only for `harness_abort` — everything else carries signal for the next agent.
+### `Trial`
+Frozen dataclass for one TSV row. 17 fields in canonical order: `exp_id`, `timestamp`, `specialist`, `parent_exp`, `baseline_exp`, `domain`, `hypothesis`, `expected_delta`, `status`, `core_metric`, `val_bpb`, `delta_vs_best`, `train_s`, `total_s`, `job_name`, `snapshot_path`, `notes`.
+`core_metric` is the task-agnostic primary metric — for language-model runs it mirrors `val_bpb`; for vision tasks it carries top-1 error or whatever the leaderboard sorts on. The duplicated `val_bpb` column is preserved for direct interop with cxcscmu-shaped TSVs.
+```python
+Trial.header()         # canonical TSV header (17 field names in order)
+trial.to_row()         # ['000', '2026-05-11T10:00:00Z', 'baseline', ...]
+Trial.from_row(rowdict)  # parse one csv.DictReader row back to a Trial
+```
+### `LineageStore(root, *, lower_is_better=True)`
+Append-only TSV writer at `root/results.tsv` plus read-side accessors. All writes hold an exclusive `fcntl.flock` across the header-write + row-write sequence, so multiple specialists can write concurrently without interleaving.
+```python
+from pathlib import Path
+from fieldkit.lineage import LineageStore, Trial, FailureLabel
+store = LineageStore(Path("magent_state/blackboard"))
+store.append(Trial(exp_id="000", ..., status=FailureLabel.BASELINE, ...))
+store.all_trials()        # list[Trial] in insertion order
+store.latest(n=30)        # tuple[Trial, ...] most recent
+store.best()              # Trial | None — best informational row by core_metric
+store.chain_to("014")     # tuple[Trial, ...] root-first, walking parent_exp
+```
+### `LineageStore.render_prompt(for_specialist, *, top_k=20, recent_n=30, last_m_full=10, session_timestamp="")`
+The deterministic Markdown renderer. Returns a `LineageSnapshot` carrying both the rendered string and the underlying structured data (so callers can index in without re-parsing). Output mirrors cxcscmu's `release_artifacts/example_lineage_pg_lineage_on_arch.txt` shape: header line, `## LEADERBOARD.md` (current best + top-K kept table), `## KNOWLEDGE.md` (current-best lineage as a nested `└─` chain + recent-activity table + last-M detailed entries).
+```python
+snap = store.render_prompt(
+    for_specialist="opt",
+    top_k=20,
+    recent_n=30,
+    last_m_full=10,
+    session_timestamp="2026-05-11T11:00:00Z",
+)
+print(snap.rendered_prompt)          # the Markdown block
+snap.current_best                    # the Trial it pointed at
+snap.chain_to_best                   # tuple[Trial, ...] root → best
+snap.top_k_leaderboard               # tuple[Trial, ...] sorted by core_metric
+```
+### `RecipeEdit`
+Frozen dataclass pairing a keep trial with its workdir snapshot and the parent snapshot. `diff()` computes a unified diff of every text file in the snapshot vs the parent on first call (binary files emit a `Binary files ... differ` marker).
+```python
+edit = RecipeEdit(
+    trial=keep_trial,
+    snapshot_path=Path("snapshots/014_opt"),
+    parent_snapshot_path=Path("snapshots/000_baseline"),
+)
+print(edit.diff())   # unified diff a/train.py → b/train.py
+```
+The baseline trial returns an empty diff (no parent).
+## Why this surface
+Three things to notice about the shape. First, `FailureLabel.is_informational` is the cxcscmu `_QUARANTINED_STATUSES` rule made into a method — any downstream consumer can read it without re-implementing the policy. Second, `LineageSnapshot` is a record of *what the agent saw* (including the rendered prompt), not just a reference to the underlying TSV state. That matters for reproducibility: if you want to know why the agent at iteration 178 made the choice it did, you read the snapshot, not the TSV. Third, `LineageStore.render_prompt` is the same deterministic function cxcscmu's `harness/blackboard.py` implements (~600 lines of careful Markdown assembly); the `fieldkit.lineage` version is the published, testable, pure-stdlib port.
+The module lands at the top level of `fieldkit` because lineage is task-agnostic. Parameter Golf uses it. NanoChat-D12 uses it. CIFAR uses it — and its `disqualified` class is the evidence that this primitive isn't language-model-specific. Putting it under `fieldkit.training` would suggest LM specificity that isn't there.
+## Samples
+- [`samples/hello-lineage.py`](https://github.com/manavsehgal/ai-field-notes/blob/main/fieldkit/samples/hello-lineage.py) — five-trial worked example: baseline, two keeps, one discard, one `eval_budget_overrun`. Prints the rendered prompt.
+- [`articles/auto-research-loop-on-spark/`](https://ainative.business/field-notes/auto-research-loop-on-spark/) — anchor article. Walks the 17-column schema, the 10-class enum semantics, and the `pg_ablation_lineage_on/off` ablation that proves the primitive's value.

fieldkit-0.3.0/docs/api/training.md ADDED Viewed

@@ -0,0 +1,85 @@
+---
+module: training
+title: fieldkit.training
+summary: Fine-tuning primitives for any RL or SFT loop on the Spark — a CPU-resident LoRA reference snapshot that sidesteps peft 0.19's offloader bug, and a pre/post weight-delta tracker for sanity-checking that gradients actually moved.
+order: 5
+---
+## What it is
+Two utilities lifted from `articles/clawgym-on-spark` for any PPO / GRPO / DPO / SFT loop on the DGX Spark's unified-memory GB10:
+- **`LoraReferenceSnapshot`** — a CPU-resident snapshot of a peft adapter's LoRA tensors plus a context manager that swaps the snapshot into the live model for one no-grad forward pass and restores trainable weights on exit. **Solves a real peft 0.19 bug**: `model.load_adapter(adapter_name="reference", is_trainable=False)` crashes with a `KeyError` under `device_map="auto"` whenever the GPU has anything else resident — peft's offload-detection over-triggers on Spark unified memory. Verified with vLLM co-resident *and* with the trainer alone. The snapshot/swap dance sidesteps the offloader entirely.
+- **`WeightDeltaTracker`** — pre/post snapshot of trainable params with L2 + max|Δ| reporting. Sanity-check that any fine-tuning step actually moved weights. The first time someone debugs "why didn't my LoRA update?" they'll wish for this.
+Both classes use **lazy `torch` imports** so `import fieldkit.training` costs nothing in environments that don't run training. Construct any class and you'll get a clear `ImportError` if `torch` (or `safetensors`, for `LoraReferenceSnapshot.from_disk`) isn't installed — install them yourself in the training environment. NeMo / Triton / pytorch-base containers ship them; pure inference envs don't.
+## Public API
+```python
+from fieldkit.training import (
+    LoraReferenceSnapshot,
+    WeightDeltaTracker,
+)
+```
+### `WeightDeltaTracker(model)`
+Snapshot every parameter for which `requires_grad` is True at construction time, copy to CPU. `delta()` re-reads the live model and computes aggregate L2 + max-abs-delta against the snapshot.
+```python
+from fieldkit.training import WeightDeltaTracker
+tracker = WeightDeltaTracker(model)
+# ... one or more optimizer steps ...
+l2, max_abs = tracker.delta()
+print(f"weight L2 = {l2:.6f}, max|Δ| = {max_abs:.6f}")
+```
+`delta()` returns `(0.0, 0.0)` when no trainable params were captured (the model was set to inference mode before construction). Tensors that became trainable *after* construction are ignored — the tracker only re-measures what it captured.
+`len(tracker)` returns the number of tensors held in the pre-snapshot. ~15 lines of math, lazy-torch import.
+### `LoraReferenceSnapshot(model, *, snapshot=None)`
+A context manager that swaps a CPU-resident snapshot's LoRA weights into the live model for one no-grad forward pass, then restores the pre-swap trainable values on exit. Default constructor snapshots the model's *current* trainable params (online-reference flavor); pass `snapshot=` directly to reuse one snapshot dict across many model instances.
+```python
+from fieldkit.training import LoraReferenceSnapshot
+# Online — snapshot current policy at step start
+snap = LoraReferenceSnapshot(model)
+# ... one or more optimizer steps on the policy ...
+with snap:
+    ref_logits = model(input_ids).logits   # frozen-policy forward
+# trainable weights restored on exit
+```
+### `LoraReferenceSnapshot.from_disk(model, adapter_dir, *, adapter_name="default", weights_filename="adapter_model.safetensors")`
+Load LoRA weights from a peft adapter directory on disk. Performs the **safetensors-key transform** required by peft: keys in the file have shape `base_model.<…>.weight` while live parameters have shape `base_model.<…>.<adapter_name>.weight`. The snapshot indexes live names so swap/restore Just Works.
+```python
+# Fixed reference — classic GRPO with SFT-init reference policy
+snap = LoraReferenceSnapshot.from_disk(
+    model,
+    adapter_dir="adapters/sft-init",
+    adapter_name="default",
+)
+for step in range(num_steps):
+    with snap:
+        ref_logits = model(...).logits
+    # ... policy update against fixed reference ...
+```
+Names that don't match the live model's trainable params are silently skipped — the loader is tolerant of LoRA targets that vary between the saved adapter and the live one (a common occurrence when adapters load into a slightly different model build).
+`len(snap)` returns the number of LoRA tensors in the snapshot. Nested `with` is rejected with a `RuntimeError` — only one swap can be active at a time.
+## Why it's only two classes
+The `clawgym-on-spark` GRPO training loop (`articles/clawgym-on-spark/scripts/grpo_train.py`) leaned on these two patterns repeatedly. They're the smallest, most-grounded surface that survived the v0.2 extract review — anything broader (a full trainer wrapper, an `RLConfig`, a peft-side adapter loader) needs a second consuming article before the API locks. Look out for them in subsequent off-policy-training pieces; the v0.3 release is where larger training surfaces will land.
+## Samples
+- [`articles/clawgym-on-spark/scripts/grpo_train.py`](https://github.com/manavsehgal/ai-field-notes/blob/main/articles/clawgym-on-spark/scripts/grpo_train.py) — the original `--reference-adapter` + snapshot/swap blocks and the `--check-weight-delta` harness this module is lifted from.

fieldkit 0.2.0__tar.gz → 0.3.0__tar.gz

fieldkit 0.2.0tar.gz → 0.3.0tar.gz