PyPI - coreml-diffusion - Versions diffs - 0.1.1__tar.gz → 0.1.3__tar.gz - Mend

coreml-diffusion 0.1.1tar.gz → 0.1.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (54) hide show

coreml_diffusion-0.1.3/.github/workflows/release-please.yml ADDED Viewed

@@ -0,0 +1,27 @@
+name: Release Please
+# Manages the release cycle: maintains a Release PR that bumps the version in
+# pyproject.toml and curates CHANGELOG.md from Conventional Commits (only the
+# user-facing types in release-please-config.json's changelog-sections are
+# surfaced). Merging that PR tags + publishes a GitHub Release.
+#
+# Runs with GH_CI_PAT (not the default GITHUB_TOKEN) so the Release it creates
+# triggers publish-pypi.yml — events made with GITHUB_TOKEN do not start other
+# workflows.
+on:
+  push:
+    branches:
+      - main
+permissions:
+  contents: write
+  pull-requests: write
+jobs:
+  release-please:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: googleapis/release-please-action@v4
+        with:
+          token: ${{ secrets.GH_CI_PAT }}

coreml_diffusion-0.1.3/.release-please-manifest.json ADDED Viewed

@@ -0,0 +1,3 @@
+{
+  ".": "0.1.3"
+}

coreml_diffusion-0.1.3/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,26 @@
+# Changelog
+## [0.1.3](https://github.com/aszc-dev/coreml-diffusion/compare/v0.1.2...v0.1.3) (2026-06-04)
+### ✨ Features
+* **convert:** add VAE and CLIP text-encoder conversion ([dc1f85b](https://github.com/aszc-dev/coreml-diffusion/commit/dc1f85bafe50d36655ff7ece0c052a30fd77bb81))
+* **inference:** end-to-end Core ML pipeline (VAE + text-encoder swap) ([ca08b16](https://github.com/aszc-dev/coreml-diffusion/commit/ca08b16729529afbdf610d0e8ec2d09b849080c6))
+### 🐛 Bug Fixes
+* **inference:** expose .device on the Core ML adapters ([30a673e](https://github.com/aszc-dev/coreml-diffusion/commit/30a673eebe3927722214d0ab6a44fbc344d18f3a))
+### 📚 Documentation
+* **readme:** link the log.aszc.dev energy benchmark writeup ([77927b5](https://github.com/aszc-dev/coreml-diffusion/commit/77927b5dd5f1311a3b3c317692f3a347e3976a54))
+## [0.1.2](https://github.com/aszc-dev/coreml-diffusion/compare/v0.1.1...v0.1.2) (2026-05-27)
+### 🐛 Bug Fixes
+* **attention:** convertible fp32 ORIGINAL attention for the Core ML GPU path ([#2](https://github.com/aszc-dev/coreml-diffusion/issues/2)) ([28e56fc](https://github.com/aszc-dev/coreml-diffusion/commit/28e56fcf8c2242ebbe4c05abd05f7e796069d7d1))

{coreml_diffusion-0.1.1 → coreml_diffusion-0.1.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: coreml-diffusion
-Version: 0.1.1
+Version: 0.1.3
 Summary: Convert diffusion-model checkpoints (SD1.5/SDXL) to Core ML for Apple Neural Engine — framework-free, ComfyUI-independent.
 Project-URL: Homepage, https://github.com/aszc-dev/coreml-diffusion
 Project-URL: Repository, https://github.com/aszc-dev/coreml-diffusion
@@ -44,6 +44,12 @@ GPU-free, embeddable in a Swift/iOS app. ANE is the differentiator — this is a
 feasibility and power efficiency for SD1.5/SDXL on ANE, not a raw-throughput claim
 against desktop GPUs.
+The power-efficiency claim is measured: in a cross-backend benchmark the ct9
+converter here runs the SD1.5 UNet on the ANE at **6-7x lower energy** than
+GPU/MPS, at the same speed — see the writeup,
+[The ANE runs the SD1.5 UNet at 6-7x lower energy than GPU/MPS](https://log.aszc.dev/ane-vs-gpu-mps-sd15-unet-energy/),
+for the methodology and the numerical-divergence tradeoff.
 The scope is diffusion architectures generally, not Stable Diffusion specifically.
 The project aims to gather, in one place: the conversion path, a reproducible
 benchmarking suite for objective comparison, a per-model catalogue documenting the

{coreml_diffusion-0.1.1 → coreml_diffusion-0.1.3}/README.md RENAMED Viewed

@@ -15,6 +15,12 @@ GPU-free, embeddable in a Swift/iOS app. ANE is the differentiator — this is a
 feasibility and power efficiency for SD1.5/SDXL on ANE, not a raw-throughput claim
 against desktop GPUs.
+The power-efficiency claim is measured: in a cross-backend benchmark the ct9
+converter here runs the SD1.5 UNet on the ANE at **6-7x lower energy** than
+GPU/MPS, at the same speed — see the writeup,
+[The ANE runs the SD1.5 UNet at 6-7x lower energy than GPU/MPS](https://log.aszc.dev/ane-vs-gpu-mps-sd15-unet-energy/),
+for the methodology and the numerical-divergence tradeoff.
 The scope is diffusion architectures generally, not Stable Diffusion specifically.
 The project aims to gather, in one place: the conversion path, a reproducible
 benchmarking suite for objective comparison, a per-model catalogue documenting the

{coreml_diffusion-0.1.1 → coreml_diffusion-0.1.3}/coreml_diffusion/__init__.py RENAMED Viewed

@@ -23,9 +23,11 @@ because a saved workflow JSON references these strings verbatim.
 from enum import Enum
 from coreml_diffusion.attention import ATTENTION_IMPLEMENTATIONS
+from coreml_diffusion.component import CONVERTIBLE_COMPONENTS
 from coreml_diffusion.model_version import ModelVersion
 from coreml_diffusion.naming import (
     QUANT_NBITS_VALUES,
+    compose_component_name,
     compose_out_name,
     lora_names_from_params,
 )
@@ -36,12 +38,16 @@ __all__ = [
     "list_model_versions",
     "list_attention_impls",
     "list_quant_modes",
+    "list_convertible_components",
     "CONTRACT_VERSION",
     "compose_out_name",
+    "compose_component_name",
     "lora_names_from_params",
     "convert",
     "build_pipeline",
     "CoreMLUNet",
+    "CoreMLVAE",
+    "CoreMLTextEncoder",
 ]
@@ -91,9 +97,20 @@ def list_quant_modes() -> list[str]:
     return list(QUANT_NBITS_VALUES)
+def list_convertible_components() -> list[str]:
+    """Convertible components, e.g. ``["unet", "vae_decoder", ...]``.
+    ``"unet"`` is the historical default; the rest are the VAE / text-encoder
+    extension. ``"text_encoder_2"`` is only meaningful for SDXL — validity per
+    model version is enforced at convert time, not advertised here.
+    """
+    return list(CONVERTIBLE_COMPONENTS)
 # Discovery-contract version. Bump per the additive-only rules in this module's
 # docstring and CONVERTER_EXTRACTION_SPEC.md "Interface contract".
-CONTRACT_VERSION = "1.0"
+# 1.1: added list_convertible_components (VAE + text-encoder conversion).
+CONTRACT_VERSION = "1.1"
 def __getattr__(name):
@@ -107,7 +124,7 @@ def __getattr__(name):
         from coreml_diffusion.convert import convert as _convert
         return _convert
-    if name in ("build_pipeline", "CoreMLUNet"):
+    if name in ("build_pipeline", "CoreMLUNet", "CoreMLVAE", "CoreMLTextEncoder"):
         from coreml_diffusion import inference
         return getattr(inference, name)

{coreml_diffusion-0.1.1 → coreml_diffusion-0.1.3}/coreml_diffusion/cli.py RENAMED Viewed

@@ -37,6 +37,7 @@ def _convert_cmd(args):
         ckpt,
         coreml_diffusion.ModelVersion[args.model_version],
         args.out,
+        component=args.component,
         batch_size=args.batch_size,
         sample_size=sample_size,
         controlnet_support=args.controlnet,
@@ -99,6 +100,13 @@ def build_parser():
         choices=coreml_diffusion.list_model_versions(include_experimental=True),
         help="Model architecture (verified: SD15, SDXL; experimental otherwise)",
     )
+    conv.add_argument(
+        "--component",
+        choices=coreml_diffusion.list_convertible_components(),
+        default="unet",
+        help="Checkpoint component to convert (default unet). VAE/text-encoder "
+        "components ignore --attn-impl/--controlnet/--lora. text_encoder_2 is SDXL-only",
+    )
     conv.add_argument("--out", required=True, help="Output .mlpackage path to write")
     conv.add_argument(
         "--height", type=int, default=512, help="Target image height (default 512)"

coreml_diffusion-0.1.3/coreml_diffusion/component.py ADDED Viewed

@@ -0,0 +1,32 @@
+"""Convertible model components — framework-free leaf.
+A single-file checkpoint bundles several sub-models; ``coreml_diffusion`` can
+convert each into its own ``.mlpackage``. This enum is the canonical identifier
+set, mirrored into the discovery contract (``list_convertible_components``) and
+the naming contract (``compose_component_name``).
+``UNET`` keeps its historical conversion path and filename (``compose_out_name``);
+the other components are the additive VAE / text-encoder extension. ``.value`` is
+the wire string used by the CLI ``--component`` flag and the discovery list — kept
+lowercase and stable, since a saved workflow / benchmark manifest references it
+verbatim (same additive-only rule as ``ModelVersion``).
+``TEXT_ENCODER_2`` is SDXL-only (its second CLIP, ``CLIPTextModelWithProjection``);
+on SD1.5 only ``TEXT_ENCODER`` exists. Validity per model version is enforced at
+convert time, not encoded here.
+"""
+from enum import Enum
+class Component(Enum):
+    UNET = "unet"
+    VAE_DECODER = "vae_decoder"
+    VAE_ENCODER = "vae_encoder"
+    TEXT_ENCODER = "text_encoder"
+    TEXT_ENCODER_2 = "text_encoder_2"
+# Declaration order is the discovery-list order; UNET leads to match the
+# historical, primary conversion path.
+CONVERTIBLE_COMPONENTS = tuple(c.value for c in Component)

{coreml_diffusion-0.1.1 → coreml_diffusion-0.1.3}/coreml_diffusion/conversion/attention.py RENAMED Viewed

@@ -9,6 +9,7 @@ CHUNK_SIZE = 512
 def apply_attention_implementation(unet, attention_implementation):
     if attention_implementation == "ORIGINAL":
+        unet.set_attn_processor(OriginalAttnProcessor())
         return unet
     if attention_implementation == "SPLIT_EINSUM":
@@ -24,6 +25,43 @@ def apply_attention_implementation(unet, attention_implementation):
     )
+class OriginalAttnProcessor:
+    """Full (non-split) multi-head attention with an fp32 score path.
+    The ORIGINAL implementation targets the Core ML GPU path (SPLIT_EINSUM* are
+    the ANE-friendly default). It is *not* diffusers' stock attention: that path
+    routes through ``F.scaled_dot_product_attention`` plus ``view(B, -1, heads,
+    d)`` reshapes that fail to convert under coremltools 9 (the same einsum graph
+    SPLIT_EINSUM uses converts cleanly). Nor is it diffusers' legacy
+    ``AttnProcessor`` — its ``get_attention_scores`` builds the score buffer with
+    ``torch.empty(query.shape[0], ...)``, whose dynamic int shape also fails ct9.
+    So this reuses the SPLIT_EINSUM conversion-safe boilerplate and supplies a
+    plain full-attention kernel that upcasts QK^T + softmax to fp32. Without the
+    upcast, fp16 self-attention at 64x64 latents (4096 query tokens) overflows ->
+    inf -> NaN after softmax.
+    """
+    def __call__(
+        self,
+        attn,
+        hidden_states,
+        encoder_hidden_states=None,
+        attention_mask=None,
+        temb=None,
+        *args,
+        **kwargs,
+    ):
+        return _attention_forward(
+            attn,
+            hidden_states,
+            encoder_hidden_states,
+            attention_mask,
+            temb,
+            original,
+        )
 class SplitEinsumAttnProcessor:
     def __call__(
         self,
@@ -82,9 +120,13 @@ def _attention_forward(
     input_ndim = hidden_states.ndim
     if input_ndim == 4:
         batch_size, channel, height, width = hidden_states.shape
-        hidden_states = hidden_states.view(
-            batch_size, channel, height * width
-        ).transpose(1, 2)
+        # flatten(2) instead of view(B, C, height * width): the explicit
+        # height * width multiplies two traced size ints, emitting an aten::mul ->
+        # aten::Int that coremltools 9 cannot fold to a const (it fails the
+        # conversion). flatten collapses the spatial dims with a single reshape and
+        # no symbolic product. Only the 4D path (VAE self-attention) hits this; the
+        # UNet routes attention through a 3D tensor, so ORIGINAL there is untouched.
+        hidden_states = hidden_states.flatten(2).transpose(1, 2)
     else:
         batch_size, _, channel = hidden_states.shape
         height = None
@@ -158,6 +200,29 @@ def _attention_forward(
     return hidden_states
+def original(q, k, v, mask, heads, dim_head):
+    """Full multi-head attention with the QK^T scaling + softmax in fp32.
+    Same ``[B, C, 1, S]`` channel-major layout and mask convention as
+    ``split_einsum`` (so it slots into ``_attention_forward`` unchanged), but
+    computes the whole score matrix per head in one batched einsum instead of the
+    per-head split. Upcasting the scores to fp32 keeps the softmax stable when the
+    converted model runs in fp16 (QK^T at 4096 tokens overflows fp16 otherwise).
+    """
+    batch = q.size(0)
+    mh_q = q.view(batch, heads, dim_head, -1).float()
+    mh_k = k.view(batch, heads, dim_head, -1).float()
+    mh_v = v.view(batch, heads, dim_head, -1)
+    weights = torch.einsum("becq,beck->bkeq", mh_q, mh_k) * (dim_head**-0.5)
+    if mask is not None:
+        weights = weights + mask
+    weights = weights.softmax(dim=1).to(mh_v.dtype)
+    outputs = torch.einsum("bkeq,beck->becq", weights, mh_v)
+    return outputs.reshape(batch, heads * dim_head, 1, -1)
 def split_einsum(q, k, v, mask, heads, dim_head):
     q_heads = _split_heads(q, heads, dim_head)
     k = k.transpose(1, 3)

coreml_diffusion-0.1.3/coreml_diffusion/conversion/text_encoder.py ADDED Viewed

@@ -0,0 +1,85 @@
+"""CLIP text-encoder wrapper adapting it to a flat Core ML tensor contract.
+Wraps a transformers ``CLIPTextModel`` (SD1.5, SDXL encoder 1) or
+``CLIPTextModelWithProjection`` (SDXL encoder 2) so the traced graph takes a
+single ``input_ids`` tensor ``(B, 77)`` and returns the embeddings the diffusion
+pipeline consumes — nothing else.
+Which hidden state and whether a pooled vector is emitted depends on the model:
+  SD1.5            : final ``last_hidden_state`` (768), no pooled
+  SDXL encoder 1   : penultimate ``hidden_states[-2]`` (768), no pooled
+  SDXL encoder 2   : penultimate ``hidden_states[-2]`` (1280) + projected pooled (1280)
+The penultimate selection is SDXL's documented behaviour (it concatenates the
+two encoders' penultimate states and uses encoder 2's projected pooled output as
+the ``add_embeds`` conditioning). ``hidden_states_index=None`` selects the final
+``last_hidden_state``; an int indexes ``hidden_states`` directly.
+The pooled vector prefers ``text_embeds`` (the projection head, encoder 2) and
+falls back to ``pooler_output`` — so the same wrapper serves both CLIP variants.
+``input_ids`` is fed as int32 at the Core ML boundary (see ``convert``).
+"""
+import contextlib
+import torch
+@contextlib.contextmanager
+def static_causal_mask(seq_len):
+    """Patch CLIP's causal-mask builder to a constant for the trace duration.
+    transformers builds the causal mask from ``query_length + past_key_values_length``,
+    both traced 0-dim size tensors; the resulting ``aten::Int`` cannot be folded to a
+    const under coremltools 9 (conversion fails). Since the converted sequence length
+    is fixed at trace time, swap ``_create_4d_causal_attention_mask`` for a closure
+    that materialises the upper-triangular ``-inf`` mask from a Python-int ``seq_len``
+    — a pure constant the frontend folds away. Mirrors ``prepare_unet_for_coreml_trace``:
+    a trace-only enablement patch, restored on exit. Shape ``(1, 1, seq, seq)``
+    broadcasts over batch/heads, so no symbolic batch leaks back in.
+    """
+    from transformers.models.clip import modeling_clip
+    original = modeling_clip._create_4d_causal_attention_mask
+    def _const_mask(input_shape, dtype, device=None, *args, **kwargs):
+        mask = torch.full(
+            (seq_len, seq_len), torch.finfo(dtype).min, dtype=dtype, device=device
+        )
+        return torch.triu(mask, diagonal=1)[None, None]
+    modeling_clip._create_4d_causal_attention_mask = _const_mask
+    try:
+        yield
+    finally:
+        modeling_clip._create_4d_causal_attention_mask = original
+class CoreMLTextEncoderWrapper(torch.nn.Module):
+    """token ids ``(B, 77)`` -> embeddings (+ optional pooled)."""
+    def __init__(self, text_encoder, *, hidden_states_index=None, output_pooled=False):
+        super().__init__()
+        self.text_encoder = text_encoder
+        self.hidden_states_index = hidden_states_index
+        self.output_pooled = output_pooled
+    def forward(self, input_ids):
+        out = self.text_encoder(
+            input_ids,
+            output_hidden_states=self.hidden_states_index is not None,
+            return_dict=True,
+        )
+        if self.hidden_states_index is None:
+            embeds = out.last_hidden_state
+        else:
+            embeds = out.hidden_states[self.hidden_states_index]
+        if not self.output_pooled:
+            return embeds
+        pooled = getattr(out, "text_embeds", None)
+        if pooled is None:
+            pooled = out.pooler_output
+        return embeds, pooled

coreml_diffusion-0.1.3/coreml_diffusion/conversion/vae.py ADDED Viewed

@@ -0,0 +1,49 @@
+"""VAE wrappers adapting ``AutoencoderKL`` to a flat Core ML tensor contract.
+Two thin modules, one per direction, mirroring ``CoreMLUNetWrapper``: they expose
+a single positional tensor in / single tensor out so the traced graph has a clean,
+named Core ML I/O signature. They call the VAE *submodules* directly
+(``post_quant_conv``/``decoder``, ``encoder``/``quant_conv``) rather than
+``decode``/``encode`` — the same op graph, but free of the ``return_dict``
+plumbing and the ``DiagonalGaussianDistribution`` wrapper that complicate tracing.
+Scaling is intentionally NOT baked in. The pipeline owns ``scaling_factor``
+(``latent = latent / scaling_factor`` before decode, ``moments`` -> distribution
+-> sample -> ``* scaling_factor`` after encode), keeping these artifacts 1:1 with
+the reference VAE.
+The mid-block self-attention is converted via the ORIGINAL (full, fp32-score)
+processor; see ``convert.convert_vae_*``.
+"""
+import torch
+class CoreMLVAEDecoderWrapper(torch.nn.Module):
+    """latent ``(B, latent_channels, h, w)`` -> image ``(B, 3, h*8, w*8)``."""
+    def __init__(self, vae):
+        super().__init__()
+        self.vae = vae
+    def forward(self, latent):
+        z = self.vae.post_quant_conv(latent)
+        return self.vae.decoder(z)
+class CoreMLVAEEncoderWrapper(torch.nn.Module):
+    """image ``(B, 3, h*8, w*8)`` -> latent moments ``(B, 2*latent_channels, h, w)``.
+    Outputs the raw moments (mean ‖ logvar) exactly as ``AutoencoderKL.encode``
+    produces them before wrapping in a ``DiagonalGaussianDistribution``. Sampling
+    (mean + std·noise) is deferred to the pipeline so the converted encoder stays
+    deterministic and noise-source agnostic.
+    """
+    def __init__(self, vae):
+        super().__init__()
+        self.vae = vae
+    def forward(self, image):
+        h = self.vae.encoder(image)
+        return self.vae.quant_conv(h)

coreml-diffusion 0.1.1__tar.gz → 0.1.3__tar.gz

coreml-diffusion 0.1.1tar.gz → 0.1.3tar.gz