RubyGems - toy - Versions diffs - 0.8.0 → 0.9.0 - Mend

toy 0.8.0 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +31 -0
data/Makefile +211 -5
data/README.md +1 -1
data/lib/toy/compute.rb +9 -0
data/lib/toy/compute_cuda.rb +8 -0
data/lib/toy/compute_metal.rb +17 -0
data/lib/toy/core/cli/new.rb +8 -0
data/lib/toy/ffi/tinynn.rb +19 -0
data/lib/toy/ffi/tinynn_cuda.rb +7 -0
data/lib/toy/ffi/tinynn_metal.rb +5 -0
data/lib/toy/llm/archs/layer_spec.rb +39 -0
data/lib/toy/llm/archs/llama_arch.rb +62 -1
data/lib/toy/llm/archs/llama_arch_cuda.rb +62 -1
data/lib/toy/llm/archs/llama_arch_metal.rb +62 -1
data/lib/toy/llm/blocks/gdn_block.rb +176 -0
data/lib/toy/llm/engine/gpt2_kv_engine.rb +11 -0
data/lib/toy/llm/engine/gpt2_kv_engine_cuda.rb +11 -0
data/lib/toy/llm/engine/gpt2_kv_engine_metal.rb +11 -0
data/lib/toy/llm/engine/llama_kv_engine.rb +10 -2
data/lib/toy/llm/engine/llama_kv_engine_cuda.rb +10 -2
data/lib/toy/llm/engine/llama_kv_engine_metal.rb +10 -2
data/lib/toy/llm/engine/llama_seq_engine.rb +16 -1
data/lib/toy/llm/engine/llama_seq_engine_cuda.rb +16 -1
data/lib/toy/llm/engine/llama_seq_engine_metal.rb +16 -1
data/lib/toy/llm/primitives/depth_scale.rb +33 -0
data/lib/toy/llm/primitives/diff_attention.rb +71 -0
data/lib/toy/llm/primitives/gdn.rb +188 -0
data/lib/toy/llm/primitives/scalable_softmax.rb +37 -0
data/lib/toy/run/eval_metal.rb +12 -0
data/lib/toy/run/infer_metal.rb +19 -0
data/lib/toy/run/train_gpt2_metal.rb +7 -0
data/lib/toy/run/train_hybrid.rb +232 -0
data/lib/toy/run/train_metal.rb +10 -0
data/lib/toy/version.rb +4 -3
data/tinynn/tinynn_backend_cuda.c +22 -0
data/tinynn/tinynn_ggml.c +231 -0
metadata +9 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: d1f8e6c6264601b49a0efd757444f45ba0c6ef684a32b0a6b32d39ad6d65bf08
-  data.tar.gz: 19de5005b49891d17e0f5fd8d45ae9f614f0240ab0754f914391547cbf4bcb7c
+  metadata.gz: e6344fb33638dcdc959b0e713aa08081d9082eecec81c095c55b633828d5f3e8
+  data.tar.gz: 0bfac0f0a5f6025cae9146f877b3d565e94a4ab0f8b69ecb60144ed4f9dab8e1
 SHA512:
-  metadata.gz: 16990c36afd421bea60a8e7c71016523a8cc6295bf0b9c1590e00fda3fa7dfa2895d0343c8f86fe75b1ae1d533311cf1cd8d08ab9888820f7793df06262a7f37
-  data.tar.gz: 1af72fb299d5eee1d6019833008098e21c94b8ec2f689c4bedf411964f20d766f53ba1a91da61a7d4cf3ff7d7001897133e6ec0dfa4ed91690cc1efec8a1d17b
+  metadata.gz: 4dc0eb7b7a049022bd5c86a4853fcc52ff1e803bd0d574cddd8ceb01742eff18d02d8703c8cdcff9a044c3d81c2bf4d1a60d42ef70680a944e86a908b853e91f
+  data.tar.gz: facf2141aebb4c5384eb25d1ea3c4fcdb00725184fdc0b176e23799a871f7f485a8564ebfa38bd2ba47f3832723476e568fd40e0ffa5057c6bda242dcbc09483

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,36 @@
 # Changelog
+## v0.9.0 — 2026-06-22
+**The Dragon / Gated-DeltaNet trainable hybrid arc.** toy grows a second block
+type and the seam to stack it heterogeneously with attention — built phase by
+phase, each independently gated.
+- **Trainable GDN (Path B)**: the gated delta rule expressed as an *unrolled
+  autograd composition* (`GDN.recur_unrolled`) of ops that each have a ggml
+  backward — so a Gated-DeltaNet layer trains with **no hand-written kernel
+  backward** (ggml has none for `GATED_DELTA_NET`); the fused kernel is kept for
+  inference. Gated by forward-parity (`recur_unrolled` == fused kernel to 1e-6,
+  incl. multi-head) + a differentiability proof.
+- **L1 Dragon primitives**: `gdn` (l2/decay-gate/update-gate/recur/gated-out),
+  `diff_attention`, `scalable_softmax`, `depth_scale`; 8 elementwise ggml ops +
+  `tnn_gated_delta_net`/`tnn_conv_1d` wired (CPU-only this arc).
+- **Per-layer `LayerSpec` seam**: a flat-int `seq_layer_kinds` dispatch (one arch
+  loop, monomorphic per-kind block call) — byte-exact on homogeneous Llama
+  (from-scratch / warm-start / lora unchanged).
+- **`GDNBlock`** (L2) + **`libexec/toy-train-hybrid`**: a self-contained
+  from-scratch **attention+GDN hybrid** trains (CE loss decreases). Folding it
+  into the shared `toy train` engine is deferred behind a union-pin Spinel
+  codegen block — re-apply protocol in `docs/roadmap/gdn-hybrid-engine-reintegration.md`.
+- **Fixes**: `#1449` whole-program training abort (backward `get_rows` index OOB)
+  root-caused as a latent ggml-alloc liveness bug and fixed toy-side
+  (`tnn_input_1d_i32_persistent`, a galloc-external token-id index) — *not* a
+  spinel codegen bug (matz closed it resolved); CUDA/Metal training restored by
+  mirroring that FFI decl into the CUDA/Metal siblings. New backward-friendly
+  shims `tnn_sqrt`/`tnn_div`/`tnn_repeat`.
+- **Performance**: CPU inference ~+27% tok/s and LoRA steady-state ~−24% vs the
+  v0.8.0-era baselines (heavy CUDA bench stable).
 ## v0.8.0 — 2026-06-12
 **The first published version** (RubyGems, gem name graciously transferred

data/Makefile CHANGED Viewed

@@ -59,7 +59,12 @@ endif
 # .a in tinynn/ combined with newer Spinel C codegen can produce
 # misaligned binaries that segfault at init (Tao hit this 2026-05-26
 # after pulling Spinel 2183a92 — the lib archives weren't rebuilt).
-SPINEL_DEPS := $(SPINEL_DIR)/spinel_analyze $(SPINEL_DIR)/spinel_codegen
+# Track the compiler BINARY: post the Ruby→C rewrite there is no
+# spinel_analyze/spinel_codegen at the checkout root (the Ruby backend
+# moved to legacy/, oracle-only), just the single `spinel` binary —
+# the right rebuild trigger, present on both the legacy and C layouts
+# (verified byte-exact green on the union pin; toy#101 Part 1).
+SPINEL_DEPS := $(SPINEL_BIN)
 CC          ?= cc
 CFLAGS      ?= -O2 -fPIC -Wall -Wextra
@@ -355,6 +360,16 @@ endif
 	$(SPINEL) --cc='cc -Wl,-u,_tnn_metal_force_link -framework Foundation -framework Metal -framework MetalKit' $< -o $@
 toy-eval-metal: libexec/toy-eval-metal
+# Convenience: run both functional gates on the pure CPU path (no parity arm).
+# These are the byte-exact infer/eval baselines. Until this target existed the
+# CPU eval gate only ran behind gate-cuda's TOY_GATE_CUDA=1, so a CPU-only eval
+# regression could reach main unnoticed — and did once (the decode_step
+# PolyArray OOB, #104/#105). Self-builds the runners via bin/toy.
+.PHONY: gate-cpu
+gate-cpu:
+	ruby prep/infer_gate.rb
+	ruby prep/eval_gate.rb
 # Convenience: run both functional gates with the CUDA parity arm enabled.
 .PHONY: gate-cuda
 gate-cuda:
@@ -449,8 +464,12 @@ gate-run-log:
 # turns the skip into a failure): MRI+Fiddle reproduces the recorded
 # Spinel from-scratch gate curve BIT-EXACT (train_baseline.txt) and the
 # smollm2-135m greedy decode ids byte-equal infer_baseline.txt.
+# Prereq on the shared .so so a NEW FFI symbol (e.g. the #1449
+# tnn_input_1d_i32_persistent) can't leave a STALE .so behind that
+# fails the native leg with a missing-symbol NativeCallError — make
+# rebuilds it from the .o's automatically.
 .PHONY: gate-mri
-gate-mri:
+gate-mri: tinynn/libtinynn_ggml_shared.so
 	ruby prep/mri_gate.rb
 # toy#60 item 4 — the COLD-START consumer gate: `toy new` scaffold →
@@ -486,6 +505,34 @@ gate-compute-surface-cuda: prep/smokes/smoke_compute_surface_cuda
 	  && echo "GATE PASS [compute-surface-cuda]: lib/toy/compute_cuda.rb device entry is live" \
 	  || { echo "GATE FAIL [compute-surface-cuda]"; exit 1; }
+# Projection-lens gate: train through W_proj only (token_embd frozen) and
+# assert the loss drops (the smoke's own "is learning" verdict). The CPU
+# smoke was an ungated diagnostic; this wires it into the gate surface.
+.PHONY: gate-projection-lens
+gate-projection-lens: prep/smokes/smoke_projection_lens
+	@out="$$(STEPS=20 ./prep/smokes/smoke_projection_lens 2>&1)"; \
+	echo "$$out" | tail -2; \
+	echo "$$out" | grep -q "projection-lens training is learning" \
+	  && echo "GATE PASS [projection-lens]: W_proj-only training learns (token_embd frozen)" \
+	  || { echo "GATE FAIL [projection-lens]"; exit 1; }
+# Metal twin of the projection-lens gate. The _metal smoke is an auto-
+# generated mirror (MIRROR_METAL) that previously built but was reachable
+# from no gate; this de-orphans it. macOS-only, skips green off Darwin
+# exactly like gate-metal.
+.PHONY: gate-projection-lens-metal
+gate-projection-lens-metal:
+ifneq ($(UNAME_S),Darwin)
+	@echo "gate-projection-lens-metal: Metal is macOS-only (uname -s = $(UNAME_S)) — skipping"; exit 0
+else
+	$(MAKE) prep/smokes/smoke_projection_lens_metal
+	@out="$$(STEPS=20 ./prep/smokes/smoke_projection_lens_metal 2>&1)"; \
+	echo "$$out" | tail -2; \
+	echo "$$out" | grep -q "projection-lens training is learning" \
+	  && echo "GATE PASS [projection-lens-metal]: W_proj-only training learns on Metal" \
+	  || { echo "GATE FAIL [projection-lens-metal]"; exit 1; }
+endif
 # K-quant MoE attention regression gate (the bug long misfiled as ggml#1506):
 # head_nbytes returned 0 for K-quant attention weights → per-head mmap stride
 # collapsed every head onto head 0 → degenerate repeating decode on OLMoE
@@ -564,7 +611,8 @@ libexec/toy-train: lib/toy/run/train.rb lib/toy/dev/toy_describe_flow.rb lib/toy
 		lib/toy/train/toy_gguf_writer.rb lib/toy/train/toy_drift_grad.rb lib/toy/models/transformer.rb \
 		lib/toy/llm/primitives/rms_norm.rb lib/toy/llm/primitives/rope.rb \
 		lib/toy/llm/primitives/swiglu.rb lib/toy/llm/primitives/gqa.rb \
-		lib/toy/llm/blocks/transformer_block.rb lib/toy/llm/archs/llama_arch.rb \
+		lib/toy/llm/blocks/transformer_block.rb lib/toy/llm/primitives/gdn.rb lib/toy/llm/blocks/gdn_block.rb \
+		lib/toy/llm/archs/layer_spec.rb lib/toy/llm/archs/llama_arch.rb \
 		lib/toy/ffi/tinynn.rb tinynn/libtinynn_ggml.a | libexec
 	$(SPINEL) $< -o $@
 toy-train: libexec/toy-train
@@ -579,7 +627,8 @@ libexec/toy-train-lora: lib/toy/run/train_lora.rb lib/toy/dev/toy_describe_flow.
 		lib/toy/train/toy_gguf_writer.rb lib/toy/train/toy_drift_grad.rb lib/toy/models/transformer.rb \
 		lib/toy/llm/primitives/rms_norm.rb lib/toy/llm/primitives/rope.rb \
 		lib/toy/llm/primitives/swiglu.rb lib/toy/llm/primitives/gqa.rb \
-		lib/toy/llm/blocks/transformer_block.rb lib/toy/llm/archs/llama_arch.rb \
+		lib/toy/llm/blocks/transformer_block.rb lib/toy/llm/primitives/gdn.rb lib/toy/llm/blocks/gdn_block.rb \
+		lib/toy/llm/archs/layer_spec.rb lib/toy/llm/archs/llama_arch.rb \
 		lib/toy/ffi/tinynn.rb tinynn/libtinynn_ggml.a | libexec
 	$(SPINEL) $< -o $@
 toy-train-lora: libexec/toy-train-lora
@@ -756,6 +805,127 @@ prep/smokes/smoke_projection_lens: prep/smokes/smoke_projection_lens.rb lib/toy/
 prep/smokes/smoke_compute_surface: prep/smokes/smoke_compute_surface.rb lib/toy/compute.rb lib/toy/llm/training_batch.rb lib/toy/llm/recipe_options.rb tinynn/libtinynn_ggml.a $(SPINEL_DEPS)
 	$(SPINEL) $< -o $@
+# Dragon/GDN Phase 1 (docs/roadmap/dragon-gdn-arch-2026-06-20.md): prove the
+# newly-wired tnn_gated_delta_net + tnn_conv_1d FFI ops compute through toy's
+# stack on the in-tree ggml. Forward-only shape gate (the recurrence runs and
+# emits the documented output shape).
+.PHONY: gate-gdn-forward
+gate-gdn-forward: prep/smokes/smoke_gdn_forward
+	@out="$$(./prep/smokes/smoke_gdn_forward 2>&1)"; \
+	echo "$$out" | tail -2; \
+	echo "$$out" | grep -q "GDN smoke PASS" \
+	  && echo "GATE PASS [gdn-forward]: tnn_gated_delta_net computes through the FFI" \
+	  || { echo "GATE FAIL [gdn-forward]"; exit 1; }
+prep/smokes/smoke_gdn_forward: prep/smokes/smoke_gdn_forward.rb lib/toy.rb lib/toy/ffi/tinynn.rb tinynn/libtinynn_ggml.a $(SPINEL_DEPS)
+	$(SPINEL) $< -o $@
+# Dragon/GDN Phase 2: the Toy::LLM::Primitives::GDN L1 composition (l2-norm,
+# log-decay + sigmoid gates, recurrence, gated output norm). The gate+l2+recur
+# chain is computed end-to-end; gated_out is shape-checked.
+.PHONY: gate-gdn-primitive
+gate-gdn-primitive: prep/smokes/smoke_gdn_primitive
+	@out="$$(./prep/smokes/smoke_gdn_primitive 2>&1)"; \
+	echo "$$out" | tail -2; \
+	echo "$$out" | grep -q "GDN primitive smoke PASS" \
+	  && echo "GATE PASS [gdn-primitive]: Toy::LLM::Primitives::GDN composes + computes" \
+	  || { echo "GATE FAIL [gdn-primitive]"; exit 1; }
+prep/smokes/smoke_gdn_primitive: prep/smokes/smoke_gdn_primitive.rb lib/toy.rb lib/toy/ffi/tinynn.rb lib/toy/llm/primitives/gdn.rb tinynn/libtinynn_ggml.a $(SPINEL_DEPS)
+	$(SPINEL) $< -o $@
+# Dragon/GDN Phase 2: the Dragon attention-side L1 primitives (DiffAttention,
+# ScalableSoftmax, DepthScale).
+.PHONY: gate-dragon-attn-prims
+gate-dragon-attn-prims: prep/smokes/smoke_dragon_attn_prims
+	@out="$$(./prep/smokes/smoke_dragon_attn_prims 2>&1)"; \
+	echo "$$out" | tail -2; \
+	echo "$$out" | grep -q "Dragon attn prims smoke PASS" \
+	  && echo "GATE PASS [dragon-attn-prims]: diff-attn / ssmax / depth-scale compose" \
+	  || { echo "GATE FAIL [dragon-attn-prims]"; exit 1; }
+prep/smokes/smoke_dragon_attn_prims: prep/smokes/smoke_dragon_attn_prims.rb lib/toy.rb lib/toy/ffi/tinynn.rb lib/toy/llm/primitives/diff_attention.rb lib/toy/llm/primitives/scalable_softmax.rb lib/toy/llm/primitives/depth_scale.rb tinynn/libtinynn_ggml.a $(SPINEL_DEPS)
+	$(SPINEL) $< -o $@
+# Dragon/GDN Phase 4 (Path B): numeric-parity gate — the UNROLLED,
+# autograd-differentiable recurrence (GDN.recur_unrolled) reproduces the FUSED
+# tnn_gated_delta_net token outputs within eps. This is what lets training use
+# the composition (every op has a ggml backward) while inference keeps the fused
+# kernel. See docs/roadmap/dragon-gdn-arch-2026-06-20.md (Phase 4).
+.PHONY: gate-gdn-unrolled-parity
+gate-gdn-unrolled-parity: prep/smokes/smoke_gdn_unrolled_parity
+	@out="$$(./prep/smokes/smoke_gdn_unrolled_parity 2>&1)"; \
+	echo "$$out" | tail -2; \
+	echo "$$out" | grep -q "GDN unrolled-parity smoke PASS" \
+	  && echo "GATE PASS [gdn-unrolled-parity]: recur_unrolled == fused kernel (eps)" \
+	  || { echo "GATE FAIL [gdn-unrolled-parity]"; exit 1; }
+prep/smokes/smoke_gdn_unrolled_parity: prep/smokes/smoke_gdn_unrolled_parity.rb lib/toy.rb lib/toy/ffi/tinynn.rb lib/toy/llm/primitives/gdn.rb tinynn/libtinynn_ggml.a $(SPINEL_DEPS)
+	$(SPINEL) $< -o $@
+# Dragon/GDN Phase 4 (Path B): the differentiability proof — ggml builds + runs
+# a backward graph through recur_unrolled and yields finite non-zero dL/dq,k,v
+# with NO hand-written fused-kernel backward. This is what makes GDN trainable.
+.PHONY: gate-gdn-unrolled-backward
+gate-gdn-unrolled-backward: prep/smokes/smoke_gdn_unrolled_backward
+	@out="$$(./prep/smokes/smoke_gdn_unrolled_backward 2>&1)"; \
+	echo "$$out" | tail -2; \
+	echo "$$out" | grep -q "GDN unrolled-backward smoke PASS" \
+	  && echo "GATE PASS [gdn-unrolled-backward]: recur_unrolled is differentiable" \
+	  || { echo "GATE FAIL [gdn-unrolled-backward]"; exit 1; }
+prep/smokes/smoke_gdn_unrolled_backward: prep/smokes/smoke_gdn_unrolled_backward.rb lib/toy.rb lib/toy/ffi/tinynn.rb lib/toy/llm/primitives/gdn.rb tinynn/libtinynn_ggml.a $(SPINEL_DEPS)
+	$(SPINEL) $< -o $@
+# Dragon/GDN Phase 5: multi-head parity — the per-head recur_unrolled looped over
+# H heads + concat'd matches the fused kernel's head packing (strided slicing).
+.PHONY: gate-gdn-unrolled-parity-mh
+gate-gdn-unrolled-parity-mh: prep/smokes/smoke_gdn_unrolled_parity_mh
+	@out="$$(./prep/smokes/smoke_gdn_unrolled_parity_mh 2>&1)"; \
+	echo "$$out" | tail -2; \
+	echo "$$out" | grep -q "GDN unrolled-parity-mh smoke PASS" \
+	  && echo "GATE PASS [gdn-unrolled-parity-mh]: H-head recur_unrolled == fused kernel" \
+	  || { echo "GATE FAIL [gdn-unrolled-parity-mh]"; exit 1; }
+prep/smokes/smoke_gdn_unrolled_parity_mh: prep/smokes/smoke_gdn_unrolled_parity_mh.rb lib/toy.rb lib/toy/ffi/tinynn.rb lib/toy/llm/primitives/gdn.rb tinynn/libtinynn_ggml.a $(SPINEL_DEPS)
+	$(SPINEL) $< -o $@
+# Dragon/GDN Phase 5 capstone: a SELF-CONTAINED from-scratch HYBRID runner (one
+# attention layer + one GDN layer, dispatched by the int-kind seam pattern) in
+# its OWN compilation unit — CE loss decreases. Proves a heterogeneous
+# attention+GDN stack trains from scratch. Separate unit so it can't corrupt the
+# byte-exact llama engine (landmine #16). Reintegration into `toy train` waits on
+# the union-pin Spinel codegen fix (master/spinelc).
+.PHONY: gate-gdn-hybrid
+gate-gdn-hybrid: libexec/toy-train-hybrid
+	@out="$$(./libexec/toy-train-hybrid 2>&1)"; \
+	echo "$$out" | tail -3; \
+	echo "$$out" | grep -q "HYBRID train smoke PASS" \
+	  && echo "GATE PASS [gdn-hybrid]: attention+GDN from-scratch hybrid trains" \
+	  || { echo "GATE FAIL [gdn-hybrid]"; exit 1; }
+libexec/toy-train-hybrid: lib/toy/run/train_hybrid.rb lib/toy.rb lib/toy/ffi/tinynn.rb \
+		lib/toy/llm/primitives/rms_norm.rb lib/toy/llm/primitives/gdn.rb \
+		lib/toy/llm/blocks/gdn_block.rb lib/toy/llm/archs/layer_spec.rb \
+		tinynn/libtinynn_ggml.a $(SPINEL_DEPS) | libexec
+	$(SPINEL) $< -o $@
+.PHONY: toy-train-hybrid
+toy-train-hybrid: libexec/toy-train-hybrid
+# Dragon/GDN Phase 5 (end-of-flow): a from-scratch model whose mixer is a
+# trainable GDNBlock trains — CE loss decreases. Proves the GDN layer is an
+# end-to-end trainable residual unit (no hand-written kernel backward).
+.PHONY: gate-gdn-train
+gate-gdn-train: prep/smokes/smoke_gdn_train
+	@out="$$(./prep/smokes/smoke_gdn_train 2>&1)"; \
+	echo "$$out" | tail -3; \
+	echo "$$out" | grep -q "GDN train smoke PASS" \
+	  && echo "GATE PASS [gdn-train]: from-scratch GDN-layer model trains (loss decreases)" \
+	  || { echo "GATE FAIL [gdn-train]"; exit 1; }
+prep/smokes/smoke_gdn_train: prep/smokes/smoke_gdn_train.rb lib/toy.rb lib/toy/ffi/tinynn.rb lib/toy/llm/primitives/rms_norm.rb lib/toy/llm/primitives/gdn.rb lib/toy/llm/blocks/gdn_block.rb tinynn/libtinynn_ggml.a $(SPINEL_DEPS)
+	$(SPINEL) $< -o $@
 # toy#64 item 8 — the CUDA compute entry (lib/toy/compute_cuda.rb), the
 # consumer-ish device-at-compile-time gate. Same shape as the CPU
 # compute-surface gate but requires compute_cuda + links the CUDA
@@ -858,7 +1028,12 @@ examples/example_07_vit_tiny: examples/07_vit_tiny.rb lib/toy/compute.rb lib/toy
 example_07: examples/example_07_vit_tiny
 .PHONY: example_07
-examples-curated: example_01 example_02 example_03 example_04 example_05 example_07
+examples/example_08_gdn_block: examples/08_gdn_block.rb lib/toy.rb lib/toy/ffi/tinynn.rb lib/toy/llm/primitives/rms_norm.rb lib/toy/llm/primitives/gdn.rb lib/toy/llm/blocks/gdn_block.rb tinynn/libtinynn_ggml.a $(SPINEL_DEPS)
+	$(SPINEL) $< -o $@
+example_08: examples/example_08_gdn_block
+.PHONY: example_08
+examples-curated: example_01 example_02 example_03 example_04 example_05 example_07 example_08
 .PHONY: examples-curated
 # L4 LoRA recipe gate. Drives the same LoRA fine-tune config as the
@@ -1965,6 +2140,37 @@ bench-update: tinynn/libtinynn_ggml.a
 bench-report: tinynn/libtinynn_ggml.a
 	ruby bench/check.rb --report
+# Metal perf leg (macOS only; #104 part C). Times the metal-vs-cpu infer
+# runners via N-differencing — steady-state decode ms/token plus the
+# metal-vs-cpu ratio on THIS machine. The baseline (bench/baselines_metal.csv)
+# is Mac-pinned, like the metal_gate float baseline; capture it with
+# `make bench-metal-update` on a QUIESCED machine (desktop load skews the
+# numbers badly). Skips green off macOS, exactly like gate-metal.
+.PHONY: bench-metal bench-metal-update bench-metal-report
+bench-metal:
+ifneq ($(UNAME_S),Darwin)
+	@echo "bench-metal: Metal is macOS-only (uname -s = $(UNAME_S)) — skipping"; exit 0
+else
+	$(MAKE) libexec/toy-infer-metal libexec/toy-infer
+	ruby bench/check_metal.rb
+endif
+bench-metal-update:
+ifneq ($(UNAME_S),Darwin)
+	@echo "bench-metal-update: Metal is macOS-only (uname -s = $(UNAME_S)) — skipping"; exit 0
+else
+	$(MAKE) libexec/toy-infer-metal libexec/toy-infer
+	ruby bench/check_metal.rb --update
+endif
+bench-metal-report:
+ifneq ($(UNAME_S),Darwin)
+	@echo "bench-metal-report: Metal is macOS-only (uname -s = $(UNAME_S)) — skipping"; exit 0
+else
+	$(MAKE) libexec/toy-infer-metal libexec/toy-infer
+	ruby bench/check_metal.rb --report
+endif
 # Routine comparison vs PyTorch — the "old-stable" yardstick — in the
 # single-machine single-GPU case. Runs ON gx10: toy CUDA benches run
 # native, the PyTorch reference (bench/ref_pytorch.py) runs in the

data/README.md CHANGED Viewed

@@ -4,7 +4,7 @@
   <img src="toy_logo.png" alt="toy" width="240" />
 </p>
-**v0.8.0** · first published gem · pre-1.0, not API-stable
+**v0.9.0** · Dragon / Gated-DeltaNet trainable hybrid arc · pre-1.0, not API-stable
 &nbsp;·&nbsp; [CHANGELOG](CHANGELOG.md)
 &nbsp;·&nbsp; [docs](docs/architecture.md)
 &nbsp;·&nbsp; [framework guide](docs/framework.md)

data/lib/toy/compute.rb CHANGED Viewed

@@ -123,6 +123,15 @@ module Toy
     def self.warm_start_recipe
       Toy::LLM::Recipes::WarmStart.new
     end
+    # toy#90 — device teardown hook. CPU has no GPU-resource lifecycle to
+    # drain, so this is a deliberate no-op; it exists only so a
+    # device-agnostic experiment body can call Toy::Device.shutdown
+    # portably before exit (the Metal entry's override is the one that
+    # actually matters — see compute_metal.rb).
+    def self.shutdown
+      nil
+    end
   end
 end

data/lib/toy/compute_cuda.rb CHANGED Viewed

@@ -92,6 +92,14 @@ module Toy
     def self.warm_start_recipe
       Toy::LLM::Recipes::WarmStartCuda.new
     end
+    # toy#90 — device teardown hook. CUDA frees its GPU allocations on
+    # process exit without a residency-set assert (unlike Metal), so this
+    # is a deliberate no-op; it exists for parity so a device-agnostic
+    # experiment body can call Toy::Device.shutdown portably.
+    def self.shutdown
+      nil
+    end
   end
 end

data/lib/toy/compute_metal.rb CHANGED Viewed

@@ -85,6 +85,23 @@ module Toy
     def self.from_scratch_recipe
       Toy::LLM::Recipes::FromScratchMetal.new
     end
+    # toy#90 — device teardown hook (THE one that matters). ggml-metal
+    # keeps a process-lifetime residency-set collection on its singleton
+    # device and asserts at the C++ static-destructor device-free that the
+    # collection is empty (vendor/ggml/src/ggml-metal/ggml-metal-device.m
+    # :618). A consumer that builds experiment_metal (toy new --lib) runs
+    # the binary directly — it gets NO GGML_METAL_NO_RESIDENCY=1 (that env
+    # is injected only by toy's own CLI subprocesses), so any Metal buffer
+    # still alive at exit aborts the process (exit 134) AFTER correct
+    # compute (toy#27 runs 3-4). Spinel has no at_exit, so a device-
+    # agnostic body MUST call Toy::Device.shutdown before returning;
+    # tnn_shutdown_engines frees every live Metal session's weights_buf
+    # (removing it from the residency set), satisfying the assert.
+    # RUNTIME-UNVERIFIED on gx10 (Linux) — Mac gate proves the exit-0.
+    def self.shutdown
+      TinyNNMetal.tnn_shutdown_engines
+    end
   end
 end

data/lib/toy/core/cli/new.rb CHANGED Viewed

@@ -284,6 +284,14 @@ module Toy
             step = step + 1
           end
           puts "experiment: ok (device=" + Toy::Device.name + ")"
+          # toy#90 — release backend resources before exit. REQUIRED on
+          # Metal: ggml-metal asserts at device-free that its residency
+          # set is empty, and a directly-run experiment_metal gets no
+          # GGML_METAL_NO_RESIDENCY=1 (that env is injected only by toy's
+          # own CLI). No-op on cpu/cuda. Spinel has no at_exit, so this
+          # explicit call is the teardown seam.
+          Toy::Device.shutdown
         RUBY
         # Per-device entry shims — device chosen at COMPILE time by

data/lib/toy/ffi/tinynn.rb CHANGED Viewed

@@ -392,6 +392,24 @@ module TinyNN
   # C-SSM (#114): state-space model primitives.
   ffi_func :tnn_ssm_conv,         [:ptr, :ptr, :ptr],       :ptr
   ffi_func :tnn_ssm_scan,         [:ptr, :ptr, :ptr, :ptr, :ptr, :ptr, :ptr, :ptr], :ptr
+  # Gated DeltaNet recurrence core (Dragon/Qwen3-Next; GDN Phase 1). Forward-only
+  # in ggml — see docs/roadmap/dragon-gdn-arch-2026-06-20.md.
+  ffi_func :tnn_gated_delta_net,  [:ptr, :ptr, :ptr, :ptr, :ptr, :ptr, :ptr], :ptr
+  ffi_func :tnn_conv_1d,          [:ptr, :ptr, :ptr, :int, :int, :int],     :ptr
+  # Elementwise ops for GDN gate math / differential attention / gated output
+  # norm (GDN Phase 2). sigmoid(beta), exp/log (log-decay + softplus), sub
+  # (A1-λA2), neg, l2_norm(q,k for the delta rule). See dragon-gdn doc.
+  ffi_func :tnn_sigmoid,          [:ptr, :ptr],             :ptr
+  ffi_func :tnn_exp,              [:ptr, :ptr],             :ptr
+  ffi_func :tnn_log,              [:ptr, :ptr],             :ptr
+  ffi_func :tnn_neg,              [:ptr, :ptr],             :ptr
+  ffi_func :tnn_sub,              [:ptr, :ptr, :ptr],       :ptr
+  ffi_func :tnn_sqrt,             [:ptr, :ptr],             :ptr
+  ffi_func :tnn_repeat,           [:ptr, :ptr, :ptr],       :ptr
+  ffi_func :tnn_div,              [:ptr, :ptr, :ptr],       :ptr
+  ffi_func :tnn_l2_norm,          [:ptr, :ptr, :double],    :ptr
+  ffi_func :tnn_softplus,         [:ptr, :ptr],             :ptr
+  ffi_func :tnn_scale_bias,       [:ptr, :ptr, :double, :double], :ptr
   ffi_func :tnn_rms_norm,         [:ptr, :ptr, :ptr, :double], :ptr
   ffi_func :tnn_softmax,          [:ptr, :ptr],             :ptr
   ffi_func :tnn_diag_mask_inf,    [:ptr, :ptr, :int],       :ptr
@@ -481,6 +499,7 @@ module TinyNN
   ffi_func :tnn_input_3d_persistent_mmap, [:ptr, :int, :int, :int, :int, :size_t], :ptr
   ffi_func :tnn_input_1d_persistent_mmap, [:ptr, :int, :int, :size_t], :ptr
   ffi_func :tnn_input_1d_f32_persistent, [:ptr, :int],         :ptr
+  ffi_func :tnn_input_1d_i32_persistent, [:ptr, :int],         :ptr
   ffi_func :tnn_finalize_weights, [:ptr],                   :int
   ffi_func :tnn_zero_tensor,      [:ptr, :ptr],             :int
   ffi_func :tnn_realize_b,        [:ptr, :ptr],             :int

data/lib/toy/ffi/tinynn_cuda.rb CHANGED Viewed

@@ -238,6 +238,13 @@ module TinyNNCuda
   ffi_func :tnn_input_2d_persistent_typed, [:ptr, :int, :int, :int], :ptr
   ffi_func :tnn_row_size,                  [:int, :int],              :long
   ffi_func :tnn_input_1d_f32_persistent, [:ptr, :int],         :ptr
+  # #1449 fix — the token-id index leaf allocated galloc-external in ctx_w (so
+  # galloc can't free its slot + reuse it for the loss output). Mirrors the CPU
+  # tinynn.rb decl; the C function lives in the shared tinynn_ggml.c (the CUDA
+  # binaries link libtinynn_ggml.a too). Without this, the mirrored CUDA engine's
+  # finalize call to tnn_input_1d_i32_persistent is an undefined method → CUDA
+  # training aborts (caught by the heavy CUDA bench, 2026-06-22).
+  ffi_func :tnn_input_1d_i32_persistent, [:ptr, :int],         :ptr
   # Phase 2 BYO-pointer mmap (CUDA path: ggml-cuda patched to expose
   # ggml_backend_cuda_buffer_from_ptr; weight tensors reference
   # cudaHostRegister'd pages and run via UVA on unified-memory SKUs).

data/lib/toy/ffi/tinynn_metal.rb CHANGED Viewed

@@ -225,6 +225,11 @@ module TinyNNMetal
   ffi_func :tnn_input_2d_persistent_typed, [:ptr, :int, :int, :int], :ptr
   ffi_func :tnn_row_size,                  [:int, :int],              :long
   ffi_func :tnn_input_1d_f32_persistent, [:ptr, :int],         :ptr
+  # #1449 fix — galloc-external token-id index leaf (ctx_w). Mirrors the CPU/CUDA
+  # decl; C function lives in the shared tinynn_ggml.c. Without it the mirrored
+  # Metal engine's finalize aborts (undefined method), same as the CUDA gap the
+  # heavy bench caught 2026-06-22. (Metal is Mac-only; not runnable here.)
+  ffi_func :tnn_input_1d_i32_persistent, [:ptr, :int],         :ptr
   # Phase 2 BYO-pointer mmap. On Metal the buffer-from-ptr path falls
   # through to ggml_backend_cpu_buffer_from_ptr (no public Metal
   # buffer_from_ptr API); the scheduler then copies host pages to

data/lib/toy/llm/archs/layer_spec.rb ADDED Viewed

@@ -0,0 +1,39 @@
+# lib/toy/llm/archs/layer_spec.rb — Phase 3 of the Dragon-GDN arc: the
+# per-layer descriptor that lets one arch forward loop build a heterogeneous
+# layer stack via flat-int kind dispatch (no polymorphic receiver). See
+# docs/roadmap/dragon-gdn-arch-2026-06-20.md.
+module Toy; module LLM; module Archs
+  # Per-layer descriptor — the seam that lets ONE arch forward loop build a
+  # heterogeneous layer stack (homogeneous Llama attention today; Dragon's
+  # Gated-DeltaNet + selective-attention mix from Phase 5) WITHOUT polymorphic
+  # method dispatch.
+  #
+  # The `kind` field is a FLAT INTEGER, deliberately not a class, symbol, or
+  # block object. The arch loop branches on `spec.kind == KIND_*` and then
+  # calls a CONCRETE typed block method inside each branch, so every
+  # `.build_forward` call site keeps a single receiver class. Funnelling
+  # heterogeneous receiver types through ONE call site is the Spinel
+  # poly-dispatch landmine (the #11/#12 family, matz/spinel#1043) the whole
+  # Dragon seam is shaped to avoid — see dragon-gdn-arch-2026-06-20.md
+  # "Phase 3 — the per-layer descriptor seam."
+  #
+  # Hand-written positional class, NEVER Struct.new (landmine #16 / #1043): a
+  # Struct's synthesized accessors unify across modules and miscompile
+  # unrelated callers, exactly like LlamaArchForwardOut / TransformerBlockCtx.
+  # Carries values, no behavior.
+  class LayerSpec
+    # Layer kinds. Flat ints so the dispatch branch stays monomorphic. The
+    # Phase-3 refactor gate only exercises KIND_ATTENTION (every layer); the
+    # GDN kind is reserved here so the seam shape is fixed before Phase 5
+    # actually wires a Gated-DeltaNet block into a branch.
+    KIND_ATTENTION = 0   # standard Llama-style attention + SwiGLU FFN
+    KIND_GDN       = 1   # Dragon Gated-DeltaNet block (Phase 5)
+    attr_accessor :kind
+    def initialize(kind)
+      @kind = kind
+    end
+  end
+end; end; end

data/lib/toy/llm/archs/llama_arch.rb CHANGED Viewed

@@ -67,6 +67,21 @@ module Toy; module LLM; module Archs
   class LlamaArch
     attr_accessor :t_seq_token_embed, :t_seq_final_norm_gamma, :t_seq_output,
                   :t_seq_w_proj, :seq_blocks_ffi,
+                  # Phase 3 — per-layer descriptor array, parallel to
+                  # seq_blocks_ffi (same length == n_layers).
+                  :seq_layer_specs,
+                  # Phase 5 — the dispatch key is a plain INT array (one kind per
+                  # layer), NOT LayerSpec.kind reads: constructing/mutating
+                  # LayerSpec objects on a realize path trips a Spinel codegen
+                  # miscompile (corrupts the token-id finalize). Mutating a plain
+                  # int array element is proven-safe. build_forward dispatches on
+                  # this; LayerSpec stays the descriptor type/constants home.
+                  :seq_layer_kinds,
+                  # Phase 5 — parallel GDN-block array (same length; entry is a
+                  # GDNBlock at KIND_GDN positions, null elsewhere). The KIND_GDN
+                  # dispatch arm calls into THIS array — a concrete typed call,
+                  # so the seam stays monomorphic per call site.
+                  :seq_gdn_blocks_ffi,
                   # Orchestration-gating carriers — bare cache ivars with
                   # no accessor before P2.5. The lens-branch guard reads
                   # seq_donor_d_in; the shared ctx reads seq_rope_cfg.
@@ -81,6 +96,15 @@ module Toy; module LLM; module Archs
       @t_seq_w_proj           = TinyNN.tnn_null_ptr
       # Seed with one block — matches the former cache init (L112).
       @seq_blocks_ffi         = [Toy::LLM::Blocks::TransformerBlock.new]
+      # Phase 3 — parallel seed: one attention spec for the seed block.
+      @seq_layer_specs        = [Toy::LLM::Archs::LayerSpec.new(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION)]
+      # Phase 5 — parallel int dispatch keys (KIND_ATTENTION for the seed).
+      @seq_layer_kinds        = [Toy::LLM::Archs::LayerSpec::KIND_ATTENTION]
+      # Phase 5 — parallel GDN-block slots. Seeded with GDNBlock placeholders so
+      # the array is MONOMORPHIC (all GDNBlock) — the seam's KIND_GDN call site
+      # never sees a mixed null/object array (Spinel poly-array landmine). At
+      # KIND_ATTENTION layers the placeholder is simply never invoked.
+      @seq_gdn_blocks_ffi     = [Toy::LLM::Blocks::GDNBlock.new]
       @seq_donor_d_in         = 0
       # The cache overwrites seq_rope_cfg with the real RoPE::Cfg before
       # build_forward runs (each realize prologue rebuilds it).
@@ -97,11 +121,33 @@ module Toy; module LLM; module Archs
     # already constructs TransformerBlock.new there, so no new class /
     # Struct / FFI :str at class load. Each realize path now calls this
     # via the cache's seq_blocks_ffi delegator chain (self.seq_arch).
+    # Phase 5 hybrid — rebuild the per-layer spec array from a per-layer GDN
+    # bool flag, using the LayerSpec CTOR (never the .kind= setter: mutating
+    # LayerSpec.kind elsewhere while build_forward reads it trips a Spinel
+    # codegen miscompile that corrupts the token-id finalize). Called after
+    # seed_blocks!, before alloc.
+    # Mark ONE layer as GDN. Takes an INT index (never an array param — a
+    # function-parameter array trips the Spinel #688 type-lock landmine, which
+    # here manifests as a token-id-finalize codegen miscompile). Mutates the
+    # plain int dispatch array element (proven-safe).
+    def set_gdn_layer!(idx)
+      @seq_layer_kinds[idx] = Toy::LLM::Archs::LayerSpec::KIND_GDN
+    end
     def seed_blocks!(n_layers)
       @seq_blocks_ffi = [Toy::LLM::Blocks::TransformerBlock.new]
+      # Phase 3 — seed the parallel spec array in lockstep. Every layer is
+      # KIND_ATTENTION for now (the homogeneous-Llama refactor gate); Phase 5
+      # overwrites individual entries with KIND_GDN for Dragon's pattern.
+      @seq_layer_specs = [Toy::LLM::Archs::LayerSpec.new(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION)]
+      @seq_gdn_blocks_ffi = [Toy::LLM::Blocks::GDNBlock.new]
+      @seq_layer_kinds = [Toy::LLM::Archs::LayerSpec::KIND_ATTENTION]
       li_init = 1
       while li_init < n_layers
         @seq_blocks_ffi.push(Toy::LLM::Blocks::TransformerBlock.new)
+        @seq_layer_specs.push(Toy::LLM::Archs::LayerSpec.new(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION))
+        @seq_gdn_blocks_ffi.push(Toy::LLM::Blocks::GDNBlock.new)
+        @seq_layer_kinds.push(Toy::LLM::Archs::LayerSpec::KIND_ATTENTION)
         li_init = li_init + 1
       end
     end
@@ -213,7 +259,22 @@ module Toy; module LLM; module Archs
       end
       li_g = 0
       while li_g < seq_n_layers
-        t_cur = self.seq_blocks_ffi[li_g].build_forward(sess, t_cur, ctx)
+        # Phase 3 — per-layer descriptor dispatch. The branch compares a FLAT
+        # INT (spec.kind) and each arm calls a CONCRETE typed block method, so
+        # every .build_forward call site stays monomorphic (one receiver
+        # class). KIND_ATTENTION is the only arm wired today; KIND_GDN gets its
+        # own arm + its own typed block array in Phase 5. Unknown kinds fail
+        # loud rather than silently building the wrong graph (never-mask rule).
+        spec_kind = self.seq_layer_kinds[li_g]
+        if spec_kind == Toy::LLM::Archs::LayerSpec::KIND_ATTENTION
+          t_cur = self.seq_blocks_ffi[li_g].build_forward(sess, t_cur, ctx)
+        elsif spec_kind == Toy::LLM::Archs::LayerSpec::KIND_GDN
+          # Concrete typed call into the parallel GDN array — the GDN block reads
+          # its own dims (set at alloc); seq_t/eps come from the shared ctx.
+          t_cur = self.seq_gdn_blocks_ffi[li_g].build_forward(sess, t_cur, seq_t, eps)
+        else
+          raise "LlamaArch#build_forward: unsupported layer kind #{spec_kind} at layer #{li_g}"
+        end
         li_g = li_g + 1
       end