PyPI - pycograd - Versions diffs - 0.0.1__tar.gz - Mend

pycograd 0.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (57) hide show

pycograd-0.0.1/MANIFEST.in +2 -0
pycograd-0.0.1/PKG-INFO +324 -0
pycograd-0.0.1/README.md +263 -0
pycograd-0.0.1/docs/HISTORY.rst +117 -0
pycograd-0.0.1/docs/LICENSE.txt +11 -0
pycograd-0.0.1/pycograd/__init__.py +390 -0
pycograd-0.0.1/pycograd/_constraints.py +148 -0
pycograd-0.0.1/pycograd/_dims.py +453 -0
pycograd-0.0.1/pycograd/_typing.py +97 -0
pycograd-0.0.1/pycograd/_version.py +669 -0
pycograd-0.0.1/pycograd/ad_graph.py +372 -0
pycograd-0.0.1/pycograd/backends/__init__.py +302 -0
pycograd-0.0.1/pycograd/backends/abstract_backend.py +73 -0
pycograd-0.0.1/pycograd/backends/cupy_backend.py +47 -0
pycograd-0.0.1/pycograd/backends/jax_backend.py +207 -0
pycograd-0.0.1/pycograd/backends/mps_backend.py +47 -0
pycograd-0.0.1/pycograd/backends/numpy_backend.py +66 -0
pycograd-0.0.1/pycograd/backends/tf_backend.py +407 -0
pycograd-0.0.1/pycograd/backends/torch_backend.py +482 -0
pycograd-0.0.1/pycograd/batching.py +638 -0
pycograd-0.0.1/pycograd/capture.py +527 -0
pycograd-0.0.1/pycograd/checkpoint.py +420 -0
pycograd-0.0.1/pycograd/compile.py +199 -0
pycograd-0.0.1/pycograd/cost.py +548 -0
pycograd-0.0.1/pycograd/data.py +115 -0
pycograd-0.0.1/pycograd/dtypes.py +152 -0
pycograd-0.0.1/pycograd/examples/__init__.py +12 -0
pycograd-0.0.1/pycograd/examples/__main__.py +242 -0
pycograd-0.0.1/pycograd/examples/models.py +953 -0
pycograd-0.0.1/pycograd/export.py +121 -0
pycograd-0.0.1/pycograd/extension.py +137 -0
pycograd-0.0.1/pycograd/forward.py +683 -0
pycograd-0.0.1/pycograd/functional.py +808 -0
pycograd-0.0.1/pycograd/ops.py +1575 -0
pycograd-0.0.1/pycograd/optimizers.py +284 -0
pycograd-0.0.1/pycograd/params.py +882 -0
pycograd-0.0.1/pycograd/passes.py +580 -0
pycograd-0.0.1/pycograd/random.py +92 -0
pycograd-0.0.1/pycograd/remat.py +779 -0
pycograd-0.0.1/pycograd/shapes.py +1174 -0
pycograd-0.0.1/pycograd/tensor.py +650 -0
pycograd-0.0.1/pycograd/trace.py +420 -0
pycograd-0.0.1/pycograd/tracer.py +531 -0
pycograd-0.0.1/pycograd/training.py +136 -0
pycograd-0.0.1/pycograd/transforms.py +1078 -0
pycograd-0.0.1/pycograd/transpose.py +167 -0
pycograd-0.0.1/pycograd/tree.py +109 -0
pycograd-0.0.1/pycograd/version.py +18 -0
pycograd-0.0.1/pycograd.egg-info/PKG-INFO +324 -0
pycograd-0.0.1/pycograd.egg-info/SOURCES.txt +56 -0
pycograd-0.0.1/pycograd.egg-info/dependency_links.txt +1 -0
pycograd-0.0.1/pycograd.egg-info/not-zip-safe +1 -0
pycograd-0.0.1/pycograd.egg-info/requires.txt +44 -0
pycograd-0.0.1/pycograd.egg-info/top_level.txt +1 -0
pycograd-0.0.1/pyproject.toml +42 -0
pycograd-0.0.1/setup.cfg +95 -0
pycograd-0.0.1/setup.py +6 -0

pycograd-0.0.1/MANIFEST.in ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ include README.md docs/HISTORY.rst docs/LICENSE.txt
2	+ recursive-exclude test *

pycograd-0.0.1/PKG-INFO ADDED Viewed

@@ -0,0 +1,324 @@
+Metadata-Version: 2.4
+Name: pycograd
+Version: 0.0.1
+Summary: A small, readable reverse-mode autograd library built on numpy and pyccolo
+Home-page: https://github.com/smacke/pycograd
+Author: Stephen Macke
+Author-email: stephen.macke@gmail.com
+License: BSD-3-Clause
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: BSD License
+Classifier: Natural Language :: English
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown; charset=UTF-8
+License-File: docs/LICENSE.txt
+Requires-Dist: ml_dtypes
+Requires-Dist: numpy
+Requires-Dist: pipescript>=0.0.22
+Requires-Dist: pyccolo>=0.0.87
+Provides-Extra: test
+Requires-Dist: black<24; extra == "test"
+Requires-Dist: ipython; extra == "test"
+Requires-Dist: isort; extra == "test"
+Requires-Dist: mypy; extra == "test"
+Requires-Dist: pytest; extra == "test"
+Requires-Dist: pytest-cov; extra == "test"
+Requires-Dist: ruff; extra == "test"
+Provides-Extra: jax
+Requires-Dist: jax; extra == "jax"
+Provides-Extra: torch
+Requires-Dist: torch; extra == "torch"
+Provides-Extra: tf
+Requires-Dist: tensorflow; extra == "tf"
+Provides-Extra: cupy
+Requires-Dist: cupy; extra == "cupy"
+Provides-Extra: onnx
+Requires-Dist: torch; extra == "onnx"
+Requires-Dist: onnx; extra == "onnx"
+Requires-Dist: onnxruntime; extra == "onnx"
+Provides-Extra: dev
+Requires-Dist: build; extra == "dev"
+Requires-Dist: pycln; extra == "dev"
+Requires-Dist: twine; extra == "dev"
+Requires-Dist: setuptools-git-versioning; extra == "dev"
+Requires-Dist: versioneer; extra == "dev"
+Requires-Dist: black<24; extra == "dev"
+Requires-Dist: ipython; extra == "dev"
+Requires-Dist: isort; extra == "dev"
+Requires-Dist: mypy; extra == "dev"
+Requires-Dist: pytest; extra == "dev"
+Requires-Dist: pytest-cov; extra == "dev"
+Requires-Dist: ruff; extra == "dev"
+Dynamic: license-file
+# pycograd
+[![pycograd](https://github.com/smacke/pycograd/actions/workflows/ci.yml/badge.svg)](https://github.com/smacke/pycograd/actions/workflows/ci.yml)
+[![checked with mypy](https://www.mypy-lang.org/static/mypy_badge.svg)](https://mypy-lang.org/)
+[![License: BSD3](https://img.shields.io/badge/License-BSD3-maroon.svg)](https://opensource.org/licenses/BSD-3-Clause)
+[![Python versions](https://img.shields.io/pypi/pyversions/pycograd.svg)](https://pypi.org/project/pycograd)
+[![PyPI version](https://img.shields.io/pypi/v/pycograd.svg)](https://pypi.org/project/pycograd)
+A small, readable reverse-mode automatic-differentiation library, built on numpy
+and [pyccolo](https://github.com/smacke/pyccolo). Write *ordinary* numeric Python
+— including `numpy` calls like `np.exp`, `np.dot`, `np.sum` and operators like
+`@` — and get correct gradients, with **no special "autodiff namespace."**
+It's small enough to read end to end — the kind of autograd you can use to
+explain backprop on one slide — but the machinery around the core scales *up*:
+auto-batching (`vmap`), forward-mode (`jvp`) and Hessians, graph capture and
+optimization, gradient checkpointing, and a one-call **compile to PyTorch / JAX /
+TensorFlow** — enough to write a Transformer or an RWKV recurrent net (see
+[`notebooks/`](notebooks/)) and have one forward pass serve all of them.
+There are **two co-equal ways to write a model**, and the same transforms work on
+both:
+- a **functional** surface — plain numpy functions you hand to `grad` /
+  `value_and_grad` / `vmap`;
+- an **ambient-weights DSL** — a `params{ ... } as weights:` block plus `|>`
+  pipelines, so you write the forward *once*, by bare name, and `weights.grad`
+  differentiates it.
+## Install
+```bash
+pip install pycograd
+```
+## Quickstart
+### Functional — `grad` over ordinary numpy
+Hand any numpy function to `grad` / `value_and_grad`; the array argument is lifted
+onto the tape for you.
+```python
+import numpy as np
+from pycograd import value_and_grad
+def f(x):
+    return np.sum(np.sin(x * x))          # ordinary numpy -- and it differentiates
+x = np.array([0.5, 1.0, 1.5])
+value, (g,) = value_and_grad(f)(x)
+# g == 2 * x * cos(x * x)
+```
+### The ambient-weights DSL — write the forward once
+In a notebook or IPython session, `%load_ext pycograd` turns on the DSL: a
+`params{ ... } as weights:` block builds a parameter pytree and injects the
+weights as ambient names, and `|>` pipelines differentiate when a model runs
+through them. Here is a 2-layer MLP classifier trained by SGD:
+```python
+%load_ext pycograd
+import numpy as np
+from pycograd import relu, softmax, cross_entropy
+rng = np.random.default_rng(42)
+X, Y = ...                                  # features and one-hot labels (3 classes)
+with params{
+    w1 = 0.3 * rng.standard_normal((2, 16)); b1 = np.zeros(16)
+    w2 = 0.3 * rng.standard_normal((16, 3)); b2 = np.zeros(3)
+} as weights:
+    logits  = $ |> $ @ w1 + b1 |> relu |> $ @ w2 + b2     # the model, written once
+    forward = $ |> logits |> softmax
+    obj     = |> X |> logits |> cross_entropy($, Y)        # mean softmax cross-entropy
+    for _ in range(10):
+        value, grads = weights.grad(obj)                   # bind weights -> Vars, backprop
+        weights.step(grads, 0.5)                           # in-place SGD
+```
+Weights are referenced by bare name; `relu` / `softmax` / `cross_entropy` are
+first-class, finite-difference-checked fused ops imported straight from
+`pycograd` (there is no op library to import for the *model* — a linear layer is
+just `$ @ w + b`). `frozen[...]` holds a weight fixed (its gradient comes back
+`None`), `tied(...)` shares one. `weights.grad` only *computes* gradients, so any
+optimizer can consume them — swap the loop for `train(weights, obj, 200,
+Adam(lr=cosine_decay(0.05, 200)))`. The very same `forward` is what `vmap` and
+`compile` consume below.
+## One forward, many uses
+The payoff of writing the forward once is that the autodiff transforms compose
+over it with no rewrites.
+### Per-sample gradients with `vmap`
+`vmap` turns a function written for **one** example into one that runs over a
+whole batch in a single vectorized pass. Composed with `grad`, it gives something
+broadcasting *cannot*: the gradient of each example separately, stacked over the
+batch.
+```python
+from pycograd import grad, vmap
+def per_example_loss(w, b, x, y):           # ONE (2,) point + ONE label -> scalar
+    return x |> $ @ w + b |> cross_entropy($, y)
+w = np.zeros((2, 3)); b = np.zeros(3)
+gw, gb, _, _ = vmap(grad(per_example_loss), in_axes=(None, None, 0, 0))(w, b, X, Y)
+# gw: (N, 2, 3)   gb: (N, 3)   -- one gradient per example
+# their batch-mean is exactly the ordinary full-batch gradient
+```
+Per-sample gradients are exactly what gradient clipping and DP-SGD need: bound
+each example's gradient norm *before* averaging. `vmap` is one trace level in the
+interpreter stack, so it composes every which way — `vmap(grad(f))` gives the
+per-sample gradients above, `grad(vmap(f))` runs straight through a batched
+forward, and `vmap(vmap(f))` nests.
+### Compile to PyTorch / JAX / TensorFlow
+The same forward can be handed to *another framework's* autodiff. Pass
+`backend=` to `weights.grad` (or `train`) and gradients come back from torch /
+jax / tf instead of pycograd's numpy tape — matching to floating-point tolerance:
+```python
+v_np, g_np = weights.grad(objective)                         # pycograd's numpy tape
+for backend in ("torch", "jax", "tf"):
+    v_be, g_be = weights.grad(objective, backend=backend, jit=True)
+    worst = max(np.max(np.abs(np.asarray(g_be[k]) - np.asarray(g_np[k]))) for k in weights)
+    print("%-5s  max|grad - grad_numpy| = %.1e" % (backend, worst))   # ~1e-6
+```
+`compile_to(forward, "torch")` instead returns a function over the framework's
+own tensors, and `weights.to_torch_module(forward)` / `export_torchscript` /
+`export_onnx` package a trained net for shipping with no pycograd dependency. A
+GRU, an attention block, or an RWKV cell written once thus trains on numpy,
+batches under `vmap`, and compiles to three frameworks with zero rewrites
+(see the notebooks below).
+## Shape inference
+Because a net is just a numpy function, you can ask what shapes it produces
+*without* training it. `eval_shape` runs the function over abstract `(shape,
+dtype)` values — no data, no allocation, so a `100000×100000` matmul is sized
+instantly — and `summary` tabulates the parameters:
+```python
+from pycograd import eval_shape, summary, ShapeDtypeStruct as S
+eval_shape(mlp_forward, S((5, 2)), S((2, 16)), S((16,)), S((16, 3)), S((3,)))  # -> f64[5,3]
+summary(mlp_batch_loss, params, (5, 2), (5, 3))            # per-weight table + total params
+```
+Shape mismatches raise a `ShapeError` that names the op and operand shapes
+(`matmul: incompatible shapes (3, 4) and (5, 6)`) instead of an opaque numpy
+message; a shape that genuinely depends on data values is reported as such rather
+than silently mis-sized.
+## Gradient checkpointing
+The tape keeps every intermediate alive until `backward`, so a deep net can run
+out of memory. `checkpoint(f)` wraps a segment so its activations are **dropped on
+the forward and recomputed in backward** — trading ~one extra forward pass for a
+large peak-memory drop. It's a drop-in: gradients are unchanged.
+```python
+from pycograd import checkpoint, value_and_grad
+def loss(x):
+    y = checkpoint(block)(x)               # block's activations are rematerialized in backward
+    return np.sum(y * y)
+value, (g,) = value_and_grad(loss)(x)      # same gradient, less memory
+```
+It composes with positional `grad` / `value_and_grad`, the ambient
+`weights.grad` loop, and `vmap` (`vmap(checkpoint(f)) == checkpoint(vmap(f))`);
+`f` must be deterministic in its inputs/weights, since it is re-run to recover the
+activations.
+## Devices / backends
+The tape runs on a pluggable **array backend** (NumPy by default). A `device(...)`
+block swaps the array library the primitives, the tape, and the optimizers compute
+with — so the same net trains on a GPU with no code changes, gradients and
+optimizer state living on-device:
+```python
+from pycograd import device, value_and_grad, Adam
+with device("cupy"):                       # requires a CUDA GPU + cupy (`pip install pycograd[cupy]`)
+    value, (g,) = value_and_grad(loss)(w)  # tape + grads on the GPU
+    w = Adam(lr=1e-3).step(w, g)           # Adam moments on the GPU too
+```
+CuPy mirrors NumPy's API, so the `np.exp` / `@` / `np.sum` code you already wrote
+"just works." For finer control, `on_cpu[...]` / `on_device(...)` pin individual
+leaves — e.g. a large embedding table on the CPU while the classifier trains on
+the GPU, one autograd graph straddling both (see the device-placement notebook).
+This is distinct from `compile_to`, which hands the net to *another framework's*
+autodiff.
+## Graph capture & rematerialization
+`capture(forward, x)` records a `|>` pipeline into a flat SSA graph you can print,
+`grad_graph` differentiates it, and `optimize` runs passes over it (CSE across the
+forward/backward boundary, dead-code elimination). On top of that, `plan_remat`
+fits a value+gradient pass under a memory budget by deciding per activation
+whether to keep, spill, or recompute it — `eval_scheduled` then runs the plan to
+the identical answer. See the [graph-viz](notebooks/pycograd_graph_viz_demo.ipynb)
+and [remat](notebooks/pycograd_remat_demo.ipynb) notebooks.
+## Examples & notebooks
+The bundled demos (logistic regression, MLP, LayerNorm/Dropout, single-head
+Transformer block, GRU/LSTM) train from scratch and are gradient-checked against
+finite differences. Run them with:
+```bash
+python -m pycograd.examples
+```
+The [`notebooks/`](notebooks/) directory walks through the library end to end:
+- [`pycograd_demo`](notebooks/pycograd_demo.ipynb) — the DSL tour: linear
+  classifier → MLP → highway net → self-attention → a Transformer encoder block.
+- [`pycograd_vmap_demo`](notebooks/pycograd_vmap_demo.ipynb) — where `vmap`
+  earns its keep: per-sample gradients, gradient clipping, batched attention.
+- [`pycograd_rnn_demo`](notebooks/pycograd_rnn_demo.ipynb) /
+  [`pycograd_rwkv_demo`](notebooks/pycograd_rwkv_demo.ipynb) — GRU/LSTM and
+  RWKV (trained in parallel, sampled in O(1)-per-token recurrent form).
+- [`pycograd_compile_*`](notebooks/) — compile/parity against PyTorch, JAX,
+  TensorFlow, and Apple MPS, plus TorchScript/ONNX export.
+- [`pycograd_device_placement_demo`](notebooks/pycograd_device_placement_demo.ipynb) —
+  a single pass split across CPU and GPU.
+- [`pycograd_graph_viz_demo`](notebooks/pycograd_graph_viz_demo.ipynb) /
+  [`pycograd_remat_demo`](notebooks/pycograd_remat_demo.ipynb) — the graph IR,
+  optimization passes, and the cost-model-driven spill/remat planner.
+`value_and_grad` / `grad` work the same in a notebook as anywhere else; the DSL is
+the only part that needs `%load_ext pycograd`.
+## How it works
+* `Var` is a reverse-mode tape node wrapping a numpy array. Arithmetic operators
+  are overloaded so that running a program builds a computation graph;
+  `Var.backward()` then walks it in reverse to accumulate gradients.
+* Operator overloading alone is *not enough*. The moment user code calls a numpy
+  function — `np.exp(x)` — numpy's ufunc machinery takes over and the gradient
+  link is lost. (`Var` sets `__array_ufunc__ = None` so this fails loudly instead
+  of silently producing a wrong gradient.) pyccolo supplies the missing piece: its
+  `before_call` event lets a handler *replace the function being called*, swapping
+  `np.exp` for a differentiable `d_exp` transparently — so idiomatic numpy code
+  "just differentiates." The same trick routes scalar `math.*` through the
+  numpy-backed primitives, differentiates through your own helper functions by
+  instrumenting them on demand, and powers the `|>` DSL.
+## License
+[BSD-3-Clause](docs/LICENSE.txt).

pycograd-0.0.1/README.md ADDED Viewed

@@ -0,0 +1,263 @@
+# pycograd
+[![pycograd](https://github.com/smacke/pycograd/actions/workflows/ci.yml/badge.svg)](https://github.com/smacke/pycograd/actions/workflows/ci.yml)
+[![checked with mypy](https://www.mypy-lang.org/static/mypy_badge.svg)](https://mypy-lang.org/)
+[![License: BSD3](https://img.shields.io/badge/License-BSD3-maroon.svg)](https://opensource.org/licenses/BSD-3-Clause)
+[![Python versions](https://img.shields.io/pypi/pyversions/pycograd.svg)](https://pypi.org/project/pycograd)
+[![PyPI version](https://img.shields.io/pypi/v/pycograd.svg)](https://pypi.org/project/pycograd)
+A small, readable reverse-mode automatic-differentiation library, built on numpy
+and [pyccolo](https://github.com/smacke/pyccolo). Write *ordinary* numeric Python
+— including `numpy` calls like `np.exp`, `np.dot`, `np.sum` and operators like
+`@` — and get correct gradients, with **no special "autodiff namespace."**
+It's small enough to read end to end — the kind of autograd you can use to
+explain backprop on one slide — but the machinery around the core scales *up*:
+auto-batching (`vmap`), forward-mode (`jvp`) and Hessians, graph capture and
+optimization, gradient checkpointing, and a one-call **compile to PyTorch / JAX /
+TensorFlow** — enough to write a Transformer or an RWKV recurrent net (see
+[`notebooks/`](notebooks/)) and have one forward pass serve all of them.
+There are **two co-equal ways to write a model**, and the same transforms work on
+both:
+- a **functional** surface — plain numpy functions you hand to `grad` /
+  `value_and_grad` / `vmap`;
+- an **ambient-weights DSL** — a `params{ ... } as weights:` block plus `|>`
+  pipelines, so you write the forward *once*, by bare name, and `weights.grad`
+  differentiates it.
+## Install
+```bash
+pip install pycograd
+```
+## Quickstart
+### Functional — `grad` over ordinary numpy
+Hand any numpy function to `grad` / `value_and_grad`; the array argument is lifted
+onto the tape for you.
+```python
+import numpy as np
+from pycograd import value_and_grad
+def f(x):
+    return np.sum(np.sin(x * x))          # ordinary numpy -- and it differentiates
+x = np.array([0.5, 1.0, 1.5])
+value, (g,) = value_and_grad(f)(x)
+# g == 2 * x * cos(x * x)
+```
+### The ambient-weights DSL — write the forward once
+In a notebook or IPython session, `%load_ext pycograd` turns on the DSL: a
+`params{ ... } as weights:` block builds a parameter pytree and injects the
+weights as ambient names, and `|>` pipelines differentiate when a model runs
+through them. Here is a 2-layer MLP classifier trained by SGD:
+```python
+%load_ext pycograd
+import numpy as np
+from pycograd import relu, softmax, cross_entropy
+rng = np.random.default_rng(42)
+X, Y = ...                                  # features and one-hot labels (3 classes)
+with params{
+    w1 = 0.3 * rng.standard_normal((2, 16)); b1 = np.zeros(16)
+    w2 = 0.3 * rng.standard_normal((16, 3)); b2 = np.zeros(3)
+} as weights:
+    logits  = $ |> $ @ w1 + b1 |> relu |> $ @ w2 + b2     # the model, written once
+    forward = $ |> logits |> softmax
+    obj     = |> X |> logits |> cross_entropy($, Y)        # mean softmax cross-entropy
+    for _ in range(10):
+        value, grads = weights.grad(obj)                   # bind weights -> Vars, backprop
+        weights.step(grads, 0.5)                           # in-place SGD
+```
+Weights are referenced by bare name; `relu` / `softmax` / `cross_entropy` are
+first-class, finite-difference-checked fused ops imported straight from
+`pycograd` (there is no op library to import for the *model* — a linear layer is
+just `$ @ w + b`). `frozen[...]` holds a weight fixed (its gradient comes back
+`None`), `tied(...)` shares one. `weights.grad` only *computes* gradients, so any
+optimizer can consume them — swap the loop for `train(weights, obj, 200,
+Adam(lr=cosine_decay(0.05, 200)))`. The very same `forward` is what `vmap` and
+`compile` consume below.
+## One forward, many uses
+The payoff of writing the forward once is that the autodiff transforms compose
+over it with no rewrites.
+### Per-sample gradients with `vmap`
+`vmap` turns a function written for **one** example into one that runs over a
+whole batch in a single vectorized pass. Composed with `grad`, it gives something
+broadcasting *cannot*: the gradient of each example separately, stacked over the
+batch.
+```python
+from pycograd import grad, vmap
+def per_example_loss(w, b, x, y):           # ONE (2,) point + ONE label -> scalar
+    return x |> $ @ w + b |> cross_entropy($, y)
+w = np.zeros((2, 3)); b = np.zeros(3)
+gw, gb, _, _ = vmap(grad(per_example_loss), in_axes=(None, None, 0, 0))(w, b, X, Y)
+# gw: (N, 2, 3)   gb: (N, 3)   -- one gradient per example
+# their batch-mean is exactly the ordinary full-batch gradient
+```
+Per-sample gradients are exactly what gradient clipping and DP-SGD need: bound
+each example's gradient norm *before* averaging. `vmap` is one trace level in the
+interpreter stack, so it composes every which way — `vmap(grad(f))` gives the
+per-sample gradients above, `grad(vmap(f))` runs straight through a batched
+forward, and `vmap(vmap(f))` nests.
+### Compile to PyTorch / JAX / TensorFlow
+The same forward can be handed to *another framework's* autodiff. Pass
+`backend=` to `weights.grad` (or `train`) and gradients come back from torch /
+jax / tf instead of pycograd's numpy tape — matching to floating-point tolerance:
+```python
+v_np, g_np = weights.grad(objective)                         # pycograd's numpy tape
+for backend in ("torch", "jax", "tf"):
+    v_be, g_be = weights.grad(objective, backend=backend, jit=True)
+    worst = max(np.max(np.abs(np.asarray(g_be[k]) - np.asarray(g_np[k]))) for k in weights)
+    print("%-5s  max|grad - grad_numpy| = %.1e" % (backend, worst))   # ~1e-6
+```
+`compile_to(forward, "torch")` instead returns a function over the framework's
+own tensors, and `weights.to_torch_module(forward)` / `export_torchscript` /
+`export_onnx` package a trained net for shipping with no pycograd dependency. A
+GRU, an attention block, or an RWKV cell written once thus trains on numpy,
+batches under `vmap`, and compiles to three frameworks with zero rewrites
+(see the notebooks below).
+## Shape inference
+Because a net is just a numpy function, you can ask what shapes it produces
+*without* training it. `eval_shape` runs the function over abstract `(shape,
+dtype)` values — no data, no allocation, so a `100000×100000` matmul is sized
+instantly — and `summary` tabulates the parameters:
+```python
+from pycograd import eval_shape, summary, ShapeDtypeStruct as S
+eval_shape(mlp_forward, S((5, 2)), S((2, 16)), S((16,)), S((16, 3)), S((3,)))  # -> f64[5,3]
+summary(mlp_batch_loss, params, (5, 2), (5, 3))            # per-weight table + total params
+```
+Shape mismatches raise a `ShapeError` that names the op and operand shapes
+(`matmul: incompatible shapes (3, 4) and (5, 6)`) instead of an opaque numpy
+message; a shape that genuinely depends on data values is reported as such rather
+than silently mis-sized.
+## Gradient checkpointing
+The tape keeps every intermediate alive until `backward`, so a deep net can run
+out of memory. `checkpoint(f)` wraps a segment so its activations are **dropped on
+the forward and recomputed in backward** — trading ~one extra forward pass for a
+large peak-memory drop. It's a drop-in: gradients are unchanged.
+```python
+from pycograd import checkpoint, value_and_grad
+def loss(x):
+    y = checkpoint(block)(x)               # block's activations are rematerialized in backward
+    return np.sum(y * y)
+value, (g,) = value_and_grad(loss)(x)      # same gradient, less memory
+```
+It composes with positional `grad` / `value_and_grad`, the ambient
+`weights.grad` loop, and `vmap` (`vmap(checkpoint(f)) == checkpoint(vmap(f))`);
+`f` must be deterministic in its inputs/weights, since it is re-run to recover the
+activations.
+## Devices / backends
+The tape runs on a pluggable **array backend** (NumPy by default). A `device(...)`
+block swaps the array library the primitives, the tape, and the optimizers compute
+with — so the same net trains on a GPU with no code changes, gradients and
+optimizer state living on-device:
+```python
+from pycograd import device, value_and_grad, Adam
+with device("cupy"):                       # requires a CUDA GPU + cupy (`pip install pycograd[cupy]`)
+    value, (g,) = value_and_grad(loss)(w)  # tape + grads on the GPU
+    w = Adam(lr=1e-3).step(w, g)           # Adam moments on the GPU too
+```
+CuPy mirrors NumPy's API, so the `np.exp` / `@` / `np.sum` code you already wrote
+"just works." For finer control, `on_cpu[...]` / `on_device(...)` pin individual
+leaves — e.g. a large embedding table on the CPU while the classifier trains on
+the GPU, one autograd graph straddling both (see the device-placement notebook).
+This is distinct from `compile_to`, which hands the net to *another framework's*
+autodiff.
+## Graph capture & rematerialization
+`capture(forward, x)` records a `|>` pipeline into a flat SSA graph you can print,
+`grad_graph` differentiates it, and `optimize` runs passes over it (CSE across the
+forward/backward boundary, dead-code elimination). On top of that, `plan_remat`
+fits a value+gradient pass under a memory budget by deciding per activation
+whether to keep, spill, or recompute it — `eval_scheduled` then runs the plan to
+the identical answer. See the [graph-viz](notebooks/pycograd_graph_viz_demo.ipynb)
+and [remat](notebooks/pycograd_remat_demo.ipynb) notebooks.
+## Examples & notebooks
+The bundled demos (logistic regression, MLP, LayerNorm/Dropout, single-head
+Transformer block, GRU/LSTM) train from scratch and are gradient-checked against
+finite differences. Run them with:
+```bash
+python -m pycograd.examples
+```
+The [`notebooks/`](notebooks/) directory walks through the library end to end:
+- [`pycograd_demo`](notebooks/pycograd_demo.ipynb) — the DSL tour: linear
+  classifier → MLP → highway net → self-attention → a Transformer encoder block.
+- [`pycograd_vmap_demo`](notebooks/pycograd_vmap_demo.ipynb) — where `vmap`
+  earns its keep: per-sample gradients, gradient clipping, batched attention.
+- [`pycograd_rnn_demo`](notebooks/pycograd_rnn_demo.ipynb) /
+  [`pycograd_rwkv_demo`](notebooks/pycograd_rwkv_demo.ipynb) — GRU/LSTM and
+  RWKV (trained in parallel, sampled in O(1)-per-token recurrent form).
+- [`pycograd_compile_*`](notebooks/) — compile/parity against PyTorch, JAX,
+  TensorFlow, and Apple MPS, plus TorchScript/ONNX export.
+- [`pycograd_device_placement_demo`](notebooks/pycograd_device_placement_demo.ipynb) —
+  a single pass split across CPU and GPU.
+- [`pycograd_graph_viz_demo`](notebooks/pycograd_graph_viz_demo.ipynb) /
+  [`pycograd_remat_demo`](notebooks/pycograd_remat_demo.ipynb) — the graph IR,
+  optimization passes, and the cost-model-driven spill/remat planner.
+`value_and_grad` / `grad` work the same in a notebook as anywhere else; the DSL is
+the only part that needs `%load_ext pycograd`.
+## How it works
+* `Var` is a reverse-mode tape node wrapping a numpy array. Arithmetic operators
+  are overloaded so that running a program builds a computation graph;
+  `Var.backward()` then walks it in reverse to accumulate gradients.
+* Operator overloading alone is *not enough*. The moment user code calls a numpy
+  function — `np.exp(x)` — numpy's ufunc machinery takes over and the gradient
+  link is lost. (`Var` sets `__array_ufunc__ = None` so this fails loudly instead
+  of silently producing a wrong gradient.) pyccolo supplies the missing piece: its
+  `before_call` event lets a handler *replace the function being called*, swapping
+  `np.exp` for a differentiable `d_exp` transparently — so idiomatic numpy code
+  "just differentiates." The same trick routes scalar `math.*` through the
+  numpy-backed primitives, differentiates through your own helper functions by
+  instrumenting them on demand, and powers the `|>` DSL.
+## License
+[BSD-3-Clause](docs/LICENSE.txt).