PyPI - liger-kernel - Versions diffs - 0.3.0__tar.gz → 0.3.1__tar.gz - Mend

liger-kernel 0.3.0tar.gz → 0.3.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (47) hide show

{liger_kernel-0.3.0/src/liger_kernel.egg-info → liger_kernel-0.3.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: liger_kernel
-Version: 0.3.0
+Version: 0.3.1
 Summary: Efficient Triton kernels for LLM Training
 License: BSD 2-CLAUSE LICENSE
         Copyright 2024 LinkedIn Corporation
@@ -32,15 +32,16 @@ License-File: LICENSE
 License-File: NOTICE
 Requires-Dist: torch>=2.1.2
 Requires-Dist: triton>=2.3.0
-Requires-Dist: transformers>=4.42.0
+Provides-Extra: transformers
+Requires-Dist: transformers~=4.0; extra == "transformers"
 Provides-Extra: dev
+Requires-Dist: transformers>=4.44.2; extra == "dev"
 Requires-Dist: matplotlib>=3.7.2; extra == "dev"
 Requires-Dist: flake8>=4.0.1.1; extra == "dev"
 Requires-Dist: black>=24.4.2; extra == "dev"
 Requires-Dist: isort>=5.13.2; extra == "dev"
 Requires-Dist: pytest>=7.1.2; extra == "dev"
 Requires-Dist: datasets>=2.19.2; extra == "dev"
-Requires-Dist: jupyter==1.0.0; extra == "dev"
 Requires-Dist: seaborn; extra == "dev"
 # Liger Kernel: Efficient Triton Kernels for LLM Training
@@ -74,8 +75,8 @@ Requires-Dist: seaborn; extra == "dev"
             </a>
         </td>
         <td style="padding: 10px;">
-            <a href="https://discord.gg/CX2YmNmn">
-                <img src="https://dcbadge.vercel.app/api/server/cudamode?style=flat" alt="Join Our Discord">
+            <a href="https://discord.gg/gpumode">
+                <img src="https://dcbadge.vercel.app/api/server/gpumode?style=flat" alt="Join Our Discord">
             </a>
         </td>
     </tr>
@@ -151,7 +152,10 @@ With one line of code, Liger Kernel can increase throughput by more than 20% and
 - `torch >= 2.1.2`
 - `triton >= 2.3.0`
-- `transformers >= 4.42.0`
+### Optional Dependencies
+- `transformers >= 4.x`: Required if you plan to use the transformers models patching APIs. The specific model you are working will dictate the minimum version of transformers.
 > **Note:**
 > Our kernels inherit the full spectrum of hardware compatibility offered by [Triton](https://github.com/triton-lang/triton).
@@ -174,7 +178,10 @@ To install from source:
 git clone https://github.com/linkedin/Liger-Kernel.git
 cd Liger-Kernel
 pip install -e .
+# or if using transformers
+pip install -e .[transformers]
 ```
 ## Getting Started
 There are a couple of ways to apply Liger kernels, depending on the level of customization required.
@@ -271,9 +278,9 @@ loss.backward()
 | Mixtral     | `liger_kernel.transformers.apply_liger_kernel_to_mixtral`  | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
 | Gemma1      | `liger_kernel.transformers.apply_liger_kernel_to_gemma`    | RoPE, RMSNorm, GeGLU, CrossEntropyLoss, FusedLinearCrossEntropy         |
 | Gemma2      | `liger_kernel.transformers.apply_liger_kernel_to_gemma2`   | RoPE, RMSNorm, GeGLU, CrossEntropyLoss         |
-| Qwen2       | `liger_kernel.transformers.apply_liger_kernel_to_qwen2`    | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
+| Qwen2 & Qwen2.5      | `liger_kernel.transformers.apply_liger_kernel_to_qwen2`    | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
 | Qwen2-VL       | `liger_kernel.transformers.apply_liger_kernel_to_qwen2_vl`    | RMSNorm, LayerNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
-| Phi3        | `liger_kernel.transformers.apply_liger_kernel_to_phi3`     | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy         |
+| Phi3 & Phi3.5       | `liger_kernel.transformers.apply_liger_kernel_to_phi3`     | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy         |

{liger_kernel-0.3.0 → liger_kernel-0.3.1}/README.md RENAMED Viewed

@@ -29,8 +29,8 @@
             </a>
         </td>
         <td style="padding: 10px;">
-            <a href="https://discord.gg/CX2YmNmn">
-                <img src="https://dcbadge.vercel.app/api/server/cudamode?style=flat" alt="Join Our Discord">
+            <a href="https://discord.gg/gpumode">
+                <img src="https://dcbadge.vercel.app/api/server/gpumode?style=flat" alt="Join Our Discord">
             </a>
         </td>
     </tr>
@@ -106,7 +106,10 @@ With one line of code, Liger Kernel can increase throughput by more than 20% and
 - `torch >= 2.1.2`
 - `triton >= 2.3.0`
-- `transformers >= 4.42.0`
+### Optional Dependencies
+- `transformers >= 4.x`: Required if you plan to use the transformers models patching APIs. The specific model you are working will dictate the minimum version of transformers.
 > **Note:**
 > Our kernels inherit the full spectrum of hardware compatibility offered by [Triton](https://github.com/triton-lang/triton).
@@ -129,7 +132,10 @@ To install from source:
 git clone https://github.com/linkedin/Liger-Kernel.git
 cd Liger-Kernel
 pip install -e .
+# or if using transformers
+pip install -e .[transformers]
 ```
 ## Getting Started
 There are a couple of ways to apply Liger kernels, depending on the level of customization required.
@@ -226,9 +232,9 @@ loss.backward()
 | Mixtral     | `liger_kernel.transformers.apply_liger_kernel_to_mixtral`  | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
 | Gemma1      | `liger_kernel.transformers.apply_liger_kernel_to_gemma`    | RoPE, RMSNorm, GeGLU, CrossEntropyLoss, FusedLinearCrossEntropy         |
 | Gemma2      | `liger_kernel.transformers.apply_liger_kernel_to_gemma2`   | RoPE, RMSNorm, GeGLU, CrossEntropyLoss         |
-| Qwen2       | `liger_kernel.transformers.apply_liger_kernel_to_qwen2`    | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
+| Qwen2 & Qwen2.5      | `liger_kernel.transformers.apply_liger_kernel_to_qwen2`    | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
 | Qwen2-VL       | `liger_kernel.transformers.apply_liger_kernel_to_qwen2_vl`    | RMSNorm, LayerNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
-| Phi3        | `liger_kernel.transformers.apply_liger_kernel_to_phi3`     | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy         |
+| Phi3 & Phi3.5       | `liger_kernel.transformers.apply_liger_kernel_to_phi3`     | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy         |

{liger_kernel-0.3.0 → liger_kernel-0.3.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "liger_kernel"
-version = "0.3.0"
+version = "0.3.1"
 description = "Efficient Triton kernels for LLM Training"
 urls = { "Homepage" = "https://github.com/linkedin/Liger-Kernel" }
 readme = { file = "README.md", content-type = "text/markdown" }
@@ -12,18 +12,21 @@ license = { file = "LICENSE" }
 dependencies = [
     "torch>=2.1.2",
     "triton>=2.3.0",
-    "transformers>=4.42.0"
 ]
 [project.optional-dependencies]
+transformers = [
+    "transformers~=4.0"
+]
 dev = [
+    "transformers>=4.44.2",
     "matplotlib>=3.7.2",
     "flake8>=4.0.1.1",
     "black>=24.4.2",
     "isort>=5.13.2",
     "pytest>=7.1.2",
     "datasets>=2.19.2",
-    "jupyter==1.0.0",
     "seaborn",
 ]

{liger_kernel-0.3.0 → liger_kernel-0.3.1}/src/liger_kernel/ops/fused_linear_cross_entropy.py RENAMED Viewed

@@ -97,7 +97,7 @@ def fused_linear_cross_entropy_forward(
         # gradient of logits_chunk is computed in-place by the above triton kernel.
         # Following HuggingFace model source code, we do the forward and backward
-        # w.r.t. logits in fp32 for numerical stability especially as the num classes (vocab size) os huge.
+        # w.r.t. logits in fp32 for numerical stability especially as the num classes (vocab size) is huge.
         # (reference: https://github.com/huggingface/transformers/blob/v4.42.4/src/transformers/models/llama/modeling_llama.py#L1194)
         # Propagating to lm_head's backward, we'll switch back to the original dtype.
         logits_chunk = logits_chunk.to(dtype)

{liger_kernel-0.3.0 → liger_kernel-0.3.1}/src/liger_kernel/ops/geglu.py RENAMED Viewed

@@ -25,7 +25,7 @@ else:
 def _geglu_tanh_forward_kernel(
     a, b, c, stride, n_cols: tl.constexpr, BLOCK_SIZE: tl.constexpr
 ):
-    program_id = tl.program_id(0)
+    program_id = tl.program_id(0).cast(tl.int64)
     # locate start index
     a += program_id * stride
@@ -52,7 +52,7 @@ def _geglu_tanh_forward_kernel(
 def _geglu_tanh_backward_kernel(
     dc, a, b, stride, n_cols: tl.constexpr, BLOCK_SIZE: tl.constexpr
 ):
-    program_id = tl.program_id(0)
+    program_id = tl.program_id(0).cast(tl.int64)
     # locate start index
     dc += program_id * stride

{liger_kernel-0.3.0 → liger_kernel-0.3.1}/src/liger_kernel/ops/kl_div.py RENAMED Viewed

@@ -45,6 +45,7 @@ def _kldiv_kernel_forward(
     loss_ptr,  # [B] or [B, S] if reduction == _REDUCTION_MODE_NONE, output ptr
     loss_stride,  # int, output stride
     n_cols,  # int, number of columns in the input tensor
+    eps,
     BLOCK_SIZE: tl.constexpr,
     log_target: tl.constexpr = False,
     reduction: tl.constexpr = _REDUCTION_MODE_BATCHMEAN,
@@ -56,6 +57,7 @@ def _kldiv_kernel_forward(
     base_offsets = tl.arange(0, BLOCK_SIZE)
+    loss_sum = 0.0
     for i in range(0, n_cols, BLOCK_SIZE):
         offsets = i + base_offsets
         mask = offsets < n_cols
@@ -65,32 +67,33 @@ def _kldiv_kernel_forward(
         # KL(y_true || y) = y_true * (log(y_true) - log(y))
         # We compute KL(y_true || y) with y in the log-space
         if not log_target:
-            loss = y_true * (tl.log(y_true) - y)
+            loss = y_true * (tl.log(tl.maximum(y_true, eps)) - y)
         else:
             loss = tl.exp(y_true) * (y_true - y)
         if reduction == _REDUCTION_MODE_NONE:
             tl.store(loss_ptr + offsets, loss, mask=mask)
         else:
-            loss = tl.sum(loss, axis=0)
-            tl.store(loss_ptr, loss)
-            loss_ptr += 1  # in case of reduction, the output tensor has dimensions [B,], therefore stride is always 1
+            loss_sum += tl.sum(loss, axis=0)
+    if reduction != _REDUCTION_MODE_NONE:
+        tl.store(loss_ptr, loss_sum)
 @triton.jit
 def _kldiv_kernel_backward(
-    input_ptr,
-    input_stride,
     target_ptr,
     target_stride,
+    new_grads_ptr,
+    new_grads_stride,
     n_cols,
     BLOCK_SIZE: tl.constexpr,
     log_target: tl.constexpr = False,
 ):
     pid = tl.program_id(0).to(tl.int64)
-    input_ptr += pid * input_stride
     target_ptr += pid * target_stride
+    new_grads_ptr += pid * new_grads_stride
     offsets = tl.arange(0, BLOCK_SIZE)
     mask = offsets < n_cols
@@ -106,19 +109,19 @@ def _kldiv_kernel_backward(
         else:
             res = -tl.exp(target)
-        tl.store(input_ptr + offsets, res, mask=mask)
+        tl.store(new_grads_ptr + offsets, res, mask=mask)
-def kldiv_forward_triton(y_pred, y_true, log_target, reduction):  # [B, S]  # [B, S]
-    B, S = y_pred.shape
+def kldiv_forward_triton(y_pred, y_true, log_target, reduction, eps):  # [BT, V]
+    BT, V = y_pred.shape
-    BLOCK_SIZE = min(MAX_FUSED_SIZE, triton.next_power_of_2(S))
+    BLOCK_SIZE = min(MAX_FUSED_SIZE, triton.next_power_of_2(V))
     num_warps = get_num_warps(BLOCK_SIZE)
-    grid = (B,)
+    grid = (BT,)
     reduction = _str_to_reduction_mode[reduction]
-    out_size = (B, S) if reduction == _REDUCTION_MODE_NONE.value else (B,)
+    out_size = (BT, V) if reduction == _REDUCTION_MODE_NONE.value else (BT,)
     output_tensor = torch.zeros(out_size, device=y_pred.device, dtype=torch.float32)
     _kldiv_kernel_forward[grid](
@@ -128,7 +131,8 @@ def kldiv_forward_triton(y_pred, y_true, log_target, reduction):  # [B, S]  # [B
         y_true.stride(0),
         output_tensor,
         output_tensor.stride(0),
-        S,
+        V,
+        eps=eps,
         BLOCK_SIZE=BLOCK_SIZE,
         num_warps=num_warps,
         log_target=log_target,
@@ -139,30 +143,30 @@ def kldiv_forward_triton(y_pred, y_true, log_target, reduction):  # [B, S]  # [B
     # https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html
     # https://github.com/pytorch/pytorch/blob/d7b57c4d63edb42e1deeeba9497fcb5f1f748ff2/torch/nn/functional.py#L3372
     if reduction == _REDUCTION_MODE_BATCHMEAN.value:
-        return output_tensor.sum() / B
+        return output_tensor.sum() / BT
     elif reduction == _REDUCTION_MODE_SUM.value:
         return output_tensor.sum(dim=0)
     elif reduction == _REDUCTION_MODE_MEAN.value:
-        return output_tensor.mean(dim=0)
+        return output_tensor.sum() / (BT * V)
     else:
         return output_tensor
-def kldiv_backward_triton(input, target, grad_output, log_target):
-    B, S = input.shape
+def kldiv_backward_triton(target, grad_output, new_grads, log_target):
+    BT, V = target.shape
-    BLOCK_SIZE = min(MAX_FUSED_SIZE, triton.next_power_of_2(S))
+    BLOCK_SIZE = min(MAX_FUSED_SIZE, triton.next_power_of_2(V))
     num_warps = get_num_warps(BLOCK_SIZE)
-    grid = (B,)
+    grid = (BT,)
     # We store the gradients in-place in the input tensor
     _kldiv_kernel_backward[grid](
-        input,
-        input.stride(0),
         target,
         target.stride(0),
-        S,
+        new_grads,
+        new_grads.stride(0),
+        V,
         BLOCK_SIZE=BLOCK_SIZE,
         num_warps=num_warps,
         log_target=log_target,
@@ -170,9 +174,9 @@ def kldiv_backward_triton(input, target, grad_output, log_target):
     # If cross entropy is the last layer, grad_output is 1.0. Skip the mul then.
     if torch.equal(grad_output, torch.tensor(1.0, device=grad_output.device)):
-        return input
+        return new_grads
-    return input * grad_output
+    return new_grads * grad_output
 class LigerKLDivLossFunction(torch.autograd.Function):
@@ -196,6 +200,7 @@ class LigerKLDivLossFunction(torch.autograd.Function):
         y_true: torch.Tensor,
         reduction: REDUCTION_LITERAL = "batchmean",
         log_target: bool = False,
+        eps: float = 1e-10,
     ) -> torch.Tensor:
         """A forward pass for the KL Divergence Loss.
@@ -205,15 +210,16 @@ class LigerKLDivLossFunction(torch.autograd.Function):
             y_true (torch.Tensor): A tensor of shape (BT, V) containing the target values, expected to be either probabilities or log-probabilities, depending on the value of `log_target`.
             reduction (REDUCTION_LITERAL, optional): Reduction to be used. Defaults to "batchmean".
             log_target (bool, optional): If set to true, expects the ground truth to already be log-probabilities. Defaults to False.
+            eps: (float, optional): A small value to avoid division by zero. Defaults to 1e-10.
         Returns:
             torch.Tensor: The computed KL Divergence Loss, with shape (BT, V) if `reduction` is "none", else a scalar.
         """
-        ctx.save_for_backward(y_pred, y_true)
+        ctx.save_for_backward(y_true)
         ctx.reduction = reduction
         ctx.log_target = log_target
         return kldiv_forward_triton(
-            y_pred, y_true, log_target=log_target, reduction=reduction
+            y_pred, y_true, log_target=log_target, reduction=reduction, eps=eps
         )
     @staticmethod
@@ -226,22 +232,27 @@ class LigerKLDivLossFunction(torch.autograd.Function):
             grad_output (torch.Tensor): The gradient of the loss with respect to the output.
         Returns:
-            tuple[torch.Tensor, None, None, None]: The gradient of the loss with respect to the inputs and None for the other arguments of the forward method.
+            tuple[torch.Tensor, None, None, None, None]: The gradient of the loss with respect to the inputs and None for the other arguments of the forward method.
         """
-        y_pred, y_true = ctx.saved_tensors
+        (y_true,) = ctx.saved_tensors
+        new_grads = torch.empty_like(y_true)
-        derivative = kldiv_backward_triton(y_pred, y_true, grad_output, ctx.log_target)
+        derivative = kldiv_backward_triton(
+            y_true, grad_output, new_grads, ctx.log_target
+        )
         if ctx.reduction == "batchmean":
-            derivative = derivative / y_pred.shape[0]
+            derivative = derivative / y_true.shape[0]
         elif ctx.reduction == "sum" or ctx.reduction == "none":
             pass
         elif ctx.reduction == "mean":
-            derivative = derivative / (y_pred.shape[0] * y_pred.shape[1])
+            derivative = derivative / (y_true.shape[0] * y_true.shape[1])
         return (
             derivative,
             None,
             None,
             None,
+            None,
         )

{liger_kernel-0.3.0 → liger_kernel-0.3.1}/src/liger_kernel/ops/swiglu.py RENAMED Viewed

@@ -14,7 +14,7 @@ def silu(x):
 def _swiglu_forward_kernel(
     a_ptr, b_ptr, c_ptr, stride, n_cols: tl.constexpr, BLOCK_SIZE: tl.constexpr
 ):
-    program_id = tl.program_id(0)
+    program_id = tl.program_id(0).cast(tl.int64)
     # locate start index
     a_ptr += program_id * stride
@@ -35,7 +35,7 @@ def _swiglu_forward_kernel(
 def _swiglu_backward_kernel(
     dc_ptr, a_ptr, b_ptr, stride, n_cols: tl.constexpr, BLOCK_SIZE: tl.constexpr
 ):
-    program_id = tl.program_id(0)
+    program_id = tl.program_id(0).cast(tl.int64)
     # locate start index
     dc_ptr += program_id * stride

{liger_kernel-0.3.0 → liger_kernel-0.3.1}/src/liger_kernel/transformers/auto_model.py RENAMED Viewed

@@ -1,6 +1,11 @@
+import inspect
 from transformers import AutoConfig, AutoModelForCausalLM
-from liger_kernel.transformers.monkey_patch import _apply_liger_kernel
+from liger_kernel.transformers.monkey_patch import (
+    MODEL_TYPE_TO_APPLY_LIGER_FN,
+    _apply_liger_kernel,
+)
 def _get_model_config(model_dir, **model_init_kwargs):
@@ -21,13 +26,20 @@ class AutoLigerKernelForCausalLM(AutoModelForCausalLM):
         # Determine the model type and apply the Liger Kernel if applicable
         # Note: _apply_liger_kernel will only pass relevant kwargs to the apply_liger_kernel_to_* function
         model_type = model_config.model_type
         _apply_liger_kernel(model_type, **kwargs)
-        # Retain only the keyword args present in the model configuration
-        for k in list(kwargs.keys()):
-            if k not in model_config.__dict__:
-                del kwargs[k]
+        # Filter out kwargs that were passed to the apply_liger_* function, which will cause
+        # model initialization errors otherwise
+        apply_fn = MODEL_TYPE_TO_APPLY_LIGER_FN[model_type]
+        apply_fn_signature = inspect.signature(apply_fn)
+        applicable_kwargs = {
+            key: value
+            for key, value in kwargs.items()
+            if key not in apply_fn_signature.parameters
+        }
         return super().from_pretrained(
-            pretrained_model_name_or_path, *model_args, **kwargs
+            pretrained_model_name_or_path, *model_args, **applicable_kwargs
         )

{liger_kernel-0.3.0 → liger_kernel-0.3.1}/src/liger_kernel/transformers/kl_div.py RENAMED Viewed

@@ -4,10 +4,11 @@ from liger_kernel.ops.kl_div import LigerKLDivLossFunction
 class LigerKLDIVLoss(nn.KLDivLoss):
-    def __init__(self, *args, **kwargs):
+    def __init__(self, eps: float = 1e-10, *args, **kwargs):
         super(LigerKLDIVLoss, self).__init__(*args, **kwargs)
+        self.eps = eps
     def forward(self, y_pred, y_true):
         return LigerKLDivLossFunction.apply(
-            y_pred, y_true, self.reduction, self.log_target
+            y_pred, y_true, self.reduction, self.log_target, self.eps
         )

{liger_kernel-0.3.0 → liger_kernel-0.3.1}/src/liger_kernel/transformers/monkey_patch.py RENAMED Viewed

@@ -1,9 +1,9 @@
 import inspect
 import logging
 from functools import partial
+from typing import Callable
-from torch import nn
-from transformers import PretrainedConfig, PreTrainedModel
+from transformers import PreTrainedModel
 from liger_kernel.transformers.cross_entropy import LigerCrossEntropyLoss
 from liger_kernel.transformers.geglu import LigerGEGLUMLP
@@ -25,6 +25,30 @@ from liger_kernel.transformers.swiglu import (
 logger = logging.getLogger(__name__)
+def _bind_method_to_module(module, method_name: str, new_method: Callable):
+    # Binds a new method to a module instance so that self is passed as the first argument
+    module.__dict__[method_name] = new_method.__get__(module, module.__class__)
+def _patch_rms_norm_module(module, offset=0.0, eps=1e-6, casting_mode="llama"):
+    module.offset = offset
+    module.casting_mode = casting_mode
+    module.variance_epsilon = (
+        getattr(module, "variance_epsilon", None) or getattr(module, "eps", None) or eps
+    )
+    _bind_method_to_module(module, "forward", LigerRMSNorm.forward)
+    _bind_method_to_module(module, "extra_repr", LigerRMSNorm.extra_repr)
+def _patch_layer_norm_module(module, eps=1e-6):
+    module.variance_epsilon = (
+        getattr(module, "variance_epsilon", None) or getattr(module, "eps", None) or eps
+    )
+    module.hidden_size = module.normalized_shape
+    _bind_method_to_module(module, "forward", LigerLayerNorm.forward)
+    _bind_method_to_module(module, "extra_repr", LigerLayerNorm.extra_repr)
 def apply_liger_kernel_to_llama(
     rope: bool = True,
     cross_entropy: bool = False,
@@ -69,7 +93,6 @@ def apply_liger_kernel_to_llama(
     if model is not None:
         # The model instance already exists, so we need to additionally patch the
         # instance variables that reference already-instantiated modules (e.g. LlamaRMSNorm or LlamaMLP)
-        config: PretrainedConfig = model.config
         if hasattr(model, "model"):
             # The case for LlamaForCausalLM or LlamaForSequenceClassification, for example
@@ -81,22 +104,17 @@ def apply_liger_kernel_to_llama(
             # Direct LlamaModel
             base_model = model
-        torch_dtype = config.torch_dtype
         if rms_norm:
-            base_model.norm = LigerRMSNorm(
-                config.hidden_size, eps=config.rms_norm_eps
-            ).to(torch_dtype)
+            _patch_rms_norm_module(base_model.norm)
         for decoder_layer in base_model.layers:
             if swiglu:
-                decoder_layer.mlp = LigerSwiGLUMLP(config).to(torch_dtype)
+                _bind_method_to_module(
+                    decoder_layer.mlp, "forward", LigerSwiGLUMLP.forward
+                )
             if rms_norm:
-                decoder_layer.input_layernorm = LigerRMSNorm(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
-                decoder_layer.post_attention_layernorm = LigerRMSNorm(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
+                _patch_rms_norm_module(decoder_layer.input_layernorm)
+                _patch_rms_norm_module(decoder_layer.post_attention_layernorm)
 def apply_liger_kernel_to_mistral(
@@ -143,7 +161,6 @@ def apply_liger_kernel_to_mistral(
     if model is not None:
         # The model instance already exists, so we need to additionally patch the
         # instance variables that reference already-instantiated modules
-        config: PretrainedConfig = model.config
         if hasattr(model, "model"):
             # The case for MistralForCausalLM, MistralForTokenClassification for example
@@ -152,22 +169,17 @@ def apply_liger_kernel_to_mistral(
             # Direct MistralModel
             base_model = model
-        torch_dtype = config.torch_dtype
         if rms_norm:
-            base_model.norm = LigerRMSNorm(
-                config.hidden_size, eps=config.rms_norm_eps
-            ).to(torch_dtype)
+            _patch_rms_norm_module(base_model.norm)
         for decoder_layer in base_model.layers:
             if swiglu:
-                decoder_layer.mlp = LigerSwiGLUMLP(config).to(torch_dtype)
+                _bind_method_to_module(
+                    decoder_layer.mlp, "forward", LigerSwiGLUMLP.forward
+                )
             if rms_norm:
-                decoder_layer.input_layernorm = LigerRMSNorm(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
-                decoder_layer.post_attention_layernorm = LigerRMSNorm(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
+                _patch_rms_norm_module(decoder_layer.input_layernorm)
+                _patch_rms_norm_module(decoder_layer.post_attention_layernorm)
 def apply_liger_kernel_to_mixtral(
@@ -214,7 +226,6 @@ def apply_liger_kernel_to_mixtral(
     if model is not None:
         # The model instance already exists, so we need to additionally patch the
         # instance variables that reference already-instantiated modules
-        config: PretrainedConfig = model.config
         if hasattr(model, "model"):
             # The case for MixtralForCausalLM, MixtralForTokenClassification for example
@@ -223,29 +234,18 @@ def apply_liger_kernel_to_mixtral(
             # Direct MixtralModel
             base_model = model
-        torch_dtype = config.torch_dtype
         if rms_norm:
-            base_model.norm = LigerRMSNorm(
-                config.hidden_size, eps=config.rms_norm_eps
-            ).to(torch_dtype)
+            _patch_rms_norm_module(base_model.norm)
         for decoder_layer in base_model.layers:
             if swiglu:
-                block_sparse_moe = decoder_layer.block_sparse_moe
-                patched_experts = nn.ModuleList(
-                    [
-                        LigerBlockSparseTop2MLP(config)
-                        for _ in range(block_sparse_moe.num_experts)
-                    ]
-                )
-                decoder_layer.block_sparse_moe.experts = patched_experts.to(torch_dtype)
+                for expert in decoder_layer.block_sparse_moe.experts:
+                    _bind_method_to_module(
+                        expert, "forward", LigerBlockSparseTop2MLP.forward
+                    )
             if rms_norm:
-                decoder_layer.input_layernorm = LigerRMSNorm(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
-                decoder_layer.post_attention_layernorm = LigerRMSNorm(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
+                _patch_rms_norm_module(decoder_layer.input_layernorm)
+                _patch_rms_norm_module(decoder_layer.post_attention_layernorm)
 def apply_liger_kernel_to_gemma(
@@ -282,6 +282,9 @@ def apply_liger_kernel_to_gemma(
     LigerRMSNormForGemma = partial(
         LigerRMSNorm, offset=1.0, init_fn="zeros", casting_mode="gemma"
     )
+    _patch_rms_norm_module_for_gemma = partial(
+        _patch_rms_norm_module, casting_mode="gemma", offset=1.0
+    )
     if rope:
         modeling_gemma.apply_rotary_pos_emb = liger_rotary_pos_emb
@@ -297,7 +300,6 @@ def apply_liger_kernel_to_gemma(
     if model is not None:
         # The model instance already exists, so we need to additionally patch the
         # instance variables that reference already-instantiated modules
-        config: PretrainedConfig = model.config
         if hasattr(model, "model"):
             # The case for GemmaForCausalLM, GemmaForTokenClassification for example
@@ -306,22 +308,17 @@ def apply_liger_kernel_to_gemma(
             # Direct GemmaModel
             base_model = model
-        torch_dtype = config.torch_dtype
         if rms_norm:
-            base_model.norm = LigerRMSNormForGemma(
-                config.hidden_size, eps=config.rms_norm_eps
-            ).to(torch_dtype)
+            _patch_rms_norm_module_for_gemma(base_model.norm)
         for decoder_layer in base_model.layers:
             if geglu:
-                decoder_layer.mlp = LigerGEGLUMLP(config).to(torch_dtype)
+                _bind_method_to_module(
+                    decoder_layer.mlp, "forward", LigerGEGLUMLP.forward
+                )
             if rms_norm:
-                decoder_layer.input_layernorm = LigerRMSNormForGemma(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
-                decoder_layer.post_attention_layernorm = LigerRMSNormForGemma(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
+                _patch_rms_norm_module_for_gemma(decoder_layer.input_layernorm)
+                _patch_rms_norm_module_for_gemma(decoder_layer.post_attention_layernorm)
 def apply_liger_kernel_to_gemma2(
@@ -343,10 +340,15 @@ def apply_liger_kernel_to_gemma2(
         model (PreTrainedModel): The model instance to apply Liger kernels to, if the model has already been
         loaded. Default is None.
     """
-    print("Got here!")
     from transformers.models.gemma2 import modeling_gemma2
-    LigerRMSNormForGemma2 = partial(LigerRMSNorm, offset=1.0, init_fn="zeros")
+    LigerRMSNormForGemma2 = partial(
+        LigerRMSNorm, offset=1.0, casting_mode="gemma", init_fn="zeros"
+    )
+    _patch_rms_norm_module_for_gemma2 = partial(
+        _patch_rms_norm_module, offset=1.0, casting_mode="gemma"
+    )
     if rope:
         modeling_gemma2.apply_rotary_pos_emb = liger_rotary_pos_emb
     if rms_norm:
@@ -360,7 +362,6 @@ def apply_liger_kernel_to_gemma2(
     if model is not None:
         # The model instance already exists, so we need to additionally patch the
         # instance variables that reference already-instantiated modules
-        config: PretrainedConfig = model.config
         if hasattr(model, "model"):
             # The case for Gemma2ForCausalLM, Gemma2ForTokenClassification for example
@@ -369,28 +370,25 @@ def apply_liger_kernel_to_gemma2(
             # Direct Gemma2Model
             base_model = model
-        torch_dtype = config.torch_dtype
         if rms_norm:
-            base_model.norm = LigerRMSNormForGemma2(
-                config.hidden_size, eps=config.rms_norm_eps
-            ).to(torch_dtype)
+            _patch_rms_norm_module_for_gemma2(base_model.norm)
         for decoder_layer in base_model.layers:
             if geglu:
-                decoder_layer.mlp = LigerGEGLUMLP(config).to(torch_dtype)
+                _bind_method_to_module(
+                    decoder_layer.mlp, "forward", LigerGEGLUMLP.forward
+                )
             if rms_norm:
-                decoder_layer.input_layernorm = LigerRMSNormForGemma2(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
-                decoder_layer.post_attention_layernorm = LigerRMSNormForGemma2(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
-                decoder_layer.pre_feedforward_layernorm = LigerRMSNormForGemma2(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
-                decoder_layer.post_feedforward_layernorm = LigerRMSNormForGemma2(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
+                _patch_rms_norm_module_for_gemma2(decoder_layer.input_layernorm)
+                _patch_rms_norm_module_for_gemma2(
+                    decoder_layer.post_attention_layernorm
+                )
+                _patch_rms_norm_module_for_gemma2(
+                    decoder_layer.pre_feedforward_layernorm
+                )
+                _patch_rms_norm_module_for_gemma2(
+                    decoder_layer.post_feedforward_layernorm
+                )
 def apply_liger_kernel_to_qwen2(
@@ -436,7 +434,6 @@ def apply_liger_kernel_to_qwen2(
     if model is not None:
         # The model instance already exists, so we need to additionally patch the
         # instance variables that reference already-instantiated modules
-        config: PretrainedConfig = model.config
         if hasattr(model, "model"):
             # The case for Qwen2ForCausalLM, Qwen2ForTokenClassification for example
@@ -445,22 +442,17 @@ def apply_liger_kernel_to_qwen2(
             # Direct Qwen2Model
             base_model = model
-        torch_dtype = config.torch_dtype
         if rms_norm:
-            base_model.norm = LigerRMSNorm(
-                config.hidden_size, eps=config.rms_norm_eps
-            ).to(torch_dtype)
+            _patch_rms_norm_module(base_model.norm)
         for decoder_layer in base_model.layers:
             if swiglu:
-                decoder_layer.mlp = LigerSwiGLUMLP(config).to(torch_dtype)
+                _bind_method_to_module(
+                    decoder_layer.mlp, "forward", LigerSwiGLUMLP.forward
+                )
             if rms_norm:
-                decoder_layer.input_layernorm = LigerRMSNorm(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
-                decoder_layer.post_attention_layernorm = LigerRMSNorm(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
+                _patch_rms_norm_module(decoder_layer.input_layernorm)
+                _patch_rms_norm_module(decoder_layer.post_attention_layernorm)
 def apply_liger_kernel_to_qwen2_vl(
@@ -499,10 +491,9 @@ def apply_liger_kernel_to_qwen2_vl(
     # TODO: Support Qwen2-VL's multimodal RoPE implementation
-    LigerRMSNormForQwen2VL = partial(LigerRMSNorm, init_fn="ones", casting_mode="gemma")
     if rms_norm:
         # https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L439
-        modeling_qwen2_vl.Qwen2RMSNorm = LigerRMSNormForQwen2VL
+        modeling_qwen2_vl.Qwen2RMSNorm = LigerRMSNorm
     if layer_norm:
         modeling_qwen2_vl.LayerNorm = LigerLayerNorm
     if cross_entropy:
@@ -515,9 +506,6 @@ def apply_liger_kernel_to_qwen2_vl(
     if model is not None:
         # The model instance already exists, so we need to additionally patch the
         # instance variables that reference already-instantiated modules
-        config: PretrainedConfig = model.config
-        torch_dtype = config.torch_dtype
         if hasattr(model, "model"):
             # The case for Qwen2VLForConditionalGeneration.
@@ -530,27 +518,19 @@ def apply_liger_kernel_to_qwen2_vl(
             # Patch Qwen2VisionTransformerPretrainedModel
             for vision_block in model.visual.blocks:
                 if layer_norm:
-                    vision_block.norm1 = LigerLayerNorm(config.embed_dim, eps=1e-6).to(
-                        torch_dtype
-                    )
-                    vision_block.norm2 = LigerLayerNorm(config.embed_dim, eps=1e-6).to(
-                        torch_dtype
-                    )
+                    _patch_layer_norm_module(vision_block.norm1)
+                    _patch_layer_norm_module(vision_block.norm2)
         if rms_norm:
-            base_model.norm = LigerRMSNormForQwen2VL(
-                config.hidden_size, eps=config.rms_norm_eps
-            ).to(torch_dtype)
+            _patch_rms_norm_module(base_model.norm)
         for decoder_layer in base_model.layers:
             if swiglu:
-                decoder_layer.mlp = LigerSwiGLUMLP(config).to(torch_dtype)
+                _bind_method_to_module(
+                    decoder_layer.mlp, "forward", LigerSwiGLUMLP.forward
+                )
             if rms_norm:
-                decoder_layer.input_layernorm = LigerRMSNormForQwen2VL(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
-                decoder_layer.post_attention_layernorm = LigerRMSNormForQwen2VL(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
+                _patch_rms_norm_module(decoder_layer.input_layernorm)
+                _patch_rms_norm_module(decoder_layer.post_attention_layernorm)
 def apply_liger_kernel_to_phi3(
@@ -596,7 +576,6 @@ def apply_liger_kernel_to_phi3(
     if model is not None:
         # The model instance already exists, so we need to additionally patch the
         # instance variables that reference already-instantiated modules
-        config: PretrainedConfig = model.config
         if hasattr(model, "model"):
             # The case for Phi3ForCausalLM, Phi3ForTokenClassification for example
@@ -605,22 +584,17 @@ def apply_liger_kernel_to_phi3(
             # Direct Phi3Model
             base_model = model
-        torch_dtype = config.torch_dtype
         if rms_norm:
-            base_model.norm = LigerRMSNorm(
-                config.hidden_size, eps=config.rms_norm_eps
-            ).to(torch_dtype)
+            _patch_rms_norm_module(base_model.norm)
         for decoder_layer in base_model.layers:
             if swiglu:
-                decoder_layer.mlp = LigerPhi3SwiGLUMLP(config).to(torch_dtype)
+                _bind_method_to_module(
+                    decoder_layer.mlp, "forward", LigerPhi3SwiGLUMLP.forward
+                )
             if rms_norm:
-                decoder_layer.input_layernorm = LigerRMSNorm(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
-                decoder_layer.post_attention_layernorm = LigerRMSNorm(
-                    config.hidden_size, eps=config.rms_norm_eps
-                ).to(torch_dtype)
+                _patch_rms_norm_module(decoder_layer.input_layernorm)
+                _patch_rms_norm_module(decoder_layer.post_attention_layernorm)
 # Model type corresponds to the keys defined in transformers/models/auto/modeling_auto.py

{liger_kernel-0.3.0 → liger_kernel-0.3.1/src/liger_kernel.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: liger_kernel
-Version: 0.3.0
+Version: 0.3.1
 Summary: Efficient Triton kernels for LLM Training
 License: BSD 2-CLAUSE LICENSE
         Copyright 2024 LinkedIn Corporation
@@ -32,15 +32,16 @@ License-File: LICENSE
 License-File: NOTICE
 Requires-Dist: torch>=2.1.2
 Requires-Dist: triton>=2.3.0
-Requires-Dist: transformers>=4.42.0
+Provides-Extra: transformers
+Requires-Dist: transformers~=4.0; extra == "transformers"
 Provides-Extra: dev
+Requires-Dist: transformers>=4.44.2; extra == "dev"
 Requires-Dist: matplotlib>=3.7.2; extra == "dev"
 Requires-Dist: flake8>=4.0.1.1; extra == "dev"
 Requires-Dist: black>=24.4.2; extra == "dev"
 Requires-Dist: isort>=5.13.2; extra == "dev"
 Requires-Dist: pytest>=7.1.2; extra == "dev"
 Requires-Dist: datasets>=2.19.2; extra == "dev"
-Requires-Dist: jupyter==1.0.0; extra == "dev"
 Requires-Dist: seaborn; extra == "dev"
 # Liger Kernel: Efficient Triton Kernels for LLM Training
@@ -74,8 +75,8 @@ Requires-Dist: seaborn; extra == "dev"
             </a>
         </td>
         <td style="padding: 10px;">
-            <a href="https://discord.gg/CX2YmNmn">
-                <img src="https://dcbadge.vercel.app/api/server/cudamode?style=flat" alt="Join Our Discord">
+            <a href="https://discord.gg/gpumode">
+                <img src="https://dcbadge.vercel.app/api/server/gpumode?style=flat" alt="Join Our Discord">
             </a>
         </td>
     </tr>
@@ -151,7 +152,10 @@ With one line of code, Liger Kernel can increase throughput by more than 20% and
 - `torch >= 2.1.2`
 - `triton >= 2.3.0`
-- `transformers >= 4.42.0`
+### Optional Dependencies
+- `transformers >= 4.x`: Required if you plan to use the transformers models patching APIs. The specific model you are working will dictate the minimum version of transformers.
 > **Note:**
 > Our kernels inherit the full spectrum of hardware compatibility offered by [Triton](https://github.com/triton-lang/triton).
@@ -174,7 +178,10 @@ To install from source:
 git clone https://github.com/linkedin/Liger-Kernel.git
 cd Liger-Kernel
 pip install -e .
+# or if using transformers
+pip install -e .[transformers]
 ```
 ## Getting Started
 There are a couple of ways to apply Liger kernels, depending on the level of customization required.
@@ -271,9 +278,9 @@ loss.backward()
 | Mixtral     | `liger_kernel.transformers.apply_liger_kernel_to_mixtral`  | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
 | Gemma1      | `liger_kernel.transformers.apply_liger_kernel_to_gemma`    | RoPE, RMSNorm, GeGLU, CrossEntropyLoss, FusedLinearCrossEntropy         |
 | Gemma2      | `liger_kernel.transformers.apply_liger_kernel_to_gemma2`   | RoPE, RMSNorm, GeGLU, CrossEntropyLoss         |
-| Qwen2       | `liger_kernel.transformers.apply_liger_kernel_to_qwen2`    | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
+| Qwen2 & Qwen2.5      | `liger_kernel.transformers.apply_liger_kernel_to_qwen2`    | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
 | Qwen2-VL       | `liger_kernel.transformers.apply_liger_kernel_to_qwen2_vl`    | RMSNorm, LayerNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
-| Phi3        | `liger_kernel.transformers.apply_liger_kernel_to_phi3`     | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy         |
+| Phi3 & Phi3.5       | `liger_kernel.transformers.apply_liger_kernel_to_phi3`     | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy         |

{liger_kernel-0.3.0 → liger_kernel-0.3.1}/src/liger_kernel.egg-info/requires.txt RENAMED Viewed

@@ -1,13 +1,15 @@
 torch>=2.1.2
 triton>=2.3.0
-transformers>=4.42.0
 [dev]
+transformers>=4.44.2
 matplotlib>=3.7.2
 flake8>=4.0.1.1
 black>=24.4.2
 isort>=5.13.2
 pytest>=7.1.2
 datasets>=2.19.2
-jupyter==1.0.0
 seaborn
+[transformers]
+transformers~=4.0