PyPI - liger-kernel-nightly - Versions diffs - 0.4.2.dev20241121054604__tar.gz → 0.4.2.dev20241121225747__tar.gz - Mend

liger-kernel-nightly 0.4.2.dev20241121054604tar.gz → 0.4.2.dev20241121225747tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (66) hide show

{liger_kernel_nightly-0.4.2.dev20241121054604/src/liger_kernel_nightly.egg-info → liger_kernel_nightly-0.4.2.dev20241121225747}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: liger_kernel_nightly
-Version: 0.4.2.dev20241121054604
+Version: 0.4.2.dev20241121225747
 Summary: Efficient Triton kernels for LLM Training
 License: BSD 2-CLAUSE LICENSE
         Copyright 2024 LinkedIn Corporation
@@ -303,8 +303,8 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$
 <!-- TODO: verify vocab sizes are accurate  -->
 - **FusedLinearCrossEntropy**: Peak memory usage of cross entropy loss is further improved by fusing the model head with the CE loss and chunking the input for block-wise loss and gradient calculation, a technique inspired by [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy). It achieves >4X memory reduction for 128k vocab size. **This is highly effective for large batch size, large sequence length, and large vocabulary sizes.** Please refer to the [Medusa example](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) for individual kernel usage.
 - **KLDivergence**: [KL Divergence](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) is implemented by fusing the forward into a single triton kernel, with reduction done outside the kernel. It achieves ~1.5X speed and ~15% memory reduction for 128K vocab size.
-- **JSD**: [Generalized JSD](https://arxiv.org/pdf/2306.13649) (Jensen-Shannon divergence), is implemented by computing both the loss and gradient in the forward pass. It achieves ~1.5X speed and ~54% memory reduction for 128k vocab size.
-- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192.
+- **JSD**: [Generalized JSD](https://arxiv.org/pdf/2306.13649) (Jensen-Shannon divergence), is implemented by computing both the loss and gradient in the forward pass. It achieves ~1.5X speed and ~54% memory reduction for 128k vocab size. **NOTE**: It implements forward/reverse KL when `beta` equals 0 and 1 respectively.
+- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192. **NOTE**: It implements forward/reverse KL when `beta` equals 0 and 1 respectively.
 ### Experimental Kernels

{liger_kernel_nightly-0.4.2.dev20241121054604 → liger_kernel_nightly-0.4.2.dev20241121225747}/README.md RENAMED Viewed

@@ -256,8 +256,8 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$
 <!-- TODO: verify vocab sizes are accurate  -->
 - **FusedLinearCrossEntropy**: Peak memory usage of cross entropy loss is further improved by fusing the model head with the CE loss and chunking the input for block-wise loss and gradient calculation, a technique inspired by [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy). It achieves >4X memory reduction for 128k vocab size. **This is highly effective for large batch size, large sequence length, and large vocabulary sizes.** Please refer to the [Medusa example](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) for individual kernel usage.
 - **KLDivergence**: [KL Divergence](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) is implemented by fusing the forward into a single triton kernel, with reduction done outside the kernel. It achieves ~1.5X speed and ~15% memory reduction for 128K vocab size.
-- **JSD**: [Generalized JSD](https://arxiv.org/pdf/2306.13649) (Jensen-Shannon divergence), is implemented by computing both the loss and gradient in the forward pass. It achieves ~1.5X speed and ~54% memory reduction for 128k vocab size.
-- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192.
+- **JSD**: [Generalized JSD](https://arxiv.org/pdf/2306.13649) (Jensen-Shannon divergence), is implemented by computing both the loss and gradient in the forward pass. It achieves ~1.5X speed and ~54% memory reduction for 128k vocab size. **NOTE**: It implements forward/reverse KL when `beta` equals 0 and 1 respectively.
+- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192. **NOTE**: It implements forward/reverse KL when `beta` equals 0 and 1 respectively.
 ### Experimental Kernels

{liger_kernel_nightly-0.4.2.dev20241121054604 → liger_kernel_nightly-0.4.2.dev20241121225747}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "liger_kernel_nightly"
-version = "0.4.2.dev20241121054604"
+version = "0.4.2.dev20241121225747"
 description = "Efficient Triton kernels for LLM Training"
 urls = { "Homepage" = "https://github.com/linkedin/Liger-Kernel" }
 readme = { file = "README.md", content-type = "text/markdown" }

{liger_kernel_nightly-0.4.2.dev20241121054604 → liger_kernel_nightly-0.4.2.dev20241121225747}/src/liger_kernel/ops/fused_linear_jsd.py RENAMED Viewed

@@ -202,7 +202,7 @@ class LigerFusedLinearJSDFunction(torch.autograd.Function):
             teacher_input (torch.tensor): input of the last projection layer in teacher model, with shape (B*T, H), where B is batch size, T is sequence length, H is hidden dimension.
             teacher_weight (torch.tensor): the last projection layer in teacher model, with shape (V, H), where V is vocab size
             shift_labels (Optional[torch.LongTensor]): indicator of next predicted vocab with shape (BT) where each value is in [0, V-1].
-            jsd_beta (float): coefficient beta of generalized JSD in the open interval (0, 1). Default: `0.5`
+            jsd_beta (float): coefficient beta of generalized JSD in the interval [0, 1]. It implements forward/reverse KL when beta equals 0 and 1 respectively. Default: `0.5`
             ignore_index (int): the index to ignore. Default: -100
             temperature (float): temperature in softmax function to control the output probability distribution. Default: `1.0`

{liger_kernel_nightly-0.4.2.dev20241121054604 → liger_kernel_nightly-0.4.2.dev20241121225747}/src/liger_kernel/ops/jsd.py RENAMED Viewed

@@ -18,7 +18,7 @@ def _jsd_kernel(
     dX_ptr,
     dX_stride,
     label_ptr,
-    beta,
+    beta: tl.constexpr,
     n_non_ignore: int,
     ignore_index: tl.constexpr,
     n_cols,
@@ -50,17 +50,26 @@ def _jsd_kernel(
         X = tl.load(X_ptr + offsets, mask=mask, other=float("-inf")).to(tl.float32)
         Y = tl.load(Y_ptr + offsets, mask=mask, other=float("-inf")).to(tl.float32)
-        Q = tl.exp(X)
-        P = tl.exp(Y)
-        M = beta * P + (1 - beta) * Q
-        log_M = tl.log(M)
+        if beta == 0.0:  # forward KL
+            Y_prob = tl.exp(Y)
+            loss = Y_prob * (Y - X)
+            dX = -Y_prob
+        elif beta == 1.0:
+            X_prob = tl.exp(X)
+            loss = X_prob * (X - Y)
+            dX = loss + X_prob
+        else:
+            Q = tl.exp(X)
+            P = tl.exp(Y)
+            M = beta * P + (1 - beta) * Q
+            log_M = tl.log(M)
+            loss = beta * P * Y + (1 - beta) * Q * X - M * log_M
+            dX = (1 - beta) * Q * (X - log_M)
-        loss = beta * P * Y + (1 - beta) * Q * X - M * log_M
-        # reduction == "batchmean"
         loss = loss / n_non_ignore
+        dX = dX / n_non_ignore
         tl.store(loss_ptr + offsets, loss, mask=mask)
-        dX = (1 - beta) * Q * (X - log_M) / n_non_ignore
         tl.store(dX_ptr + offsets, dX, mask=mask)
@@ -142,7 +151,7 @@ class LigerJSDFunction(torch.autograd.Function):
             _input (torch.Tensor): predict values with shape (BT, V) in logspace
             target (torch.Tensor): ground truth values with shape (BT, V) in logspace
             shift_labels (Optional[torch.LongTensor]): indicator of next predicted vocab with shape (BT) where each value is in [0, V-1].
-            beta (float): coefficient beta of generalized JSD in the open interval (0, 1)
+            beta (float): coefficient beta of generalized JSD in the interval [0, 1]. It implements forward/reverse KL when beta equals 0 and 1 respectively. Default: `0.5`
             ignore_index (int): the index to ignore. Default: -100
         Returns:

liger_kernel_nightly-0.4.2.dev20241121225747/src/liger_kernel/transformers/functional.py ADDED Viewed

@@ -0,0 +1,173 @@
+from typing import Optional
+from liger_kernel.ops.cross_entropy import LigerCrossEntropyFunction
+from liger_kernel.ops.fused_linear_cross_entropy import (
+    LigerFusedLinearCrossEntropyFunction,
+)
+from liger_kernel.ops.fused_linear_jsd import LigerFusedLinearJSDFunction
+from liger_kernel.ops.geglu import LigerGELUMulFunction
+from liger_kernel.ops.group_norm import LigerGroupNormFunction
+from liger_kernel.ops.jsd import LigerJSDFunction
+from liger_kernel.ops.kl_div import LigerKLDivLossFunction
+from liger_kernel.ops.layer_norm import LigerLayerNormFunction
+from liger_kernel.ops.qwen2vl_mrope import LigerQwen2VLMRopeFunction
+from liger_kernel.ops.rms_norm import LigerRMSNormFunction
+from liger_kernel.ops.rope import LigerRopeFunction
+from liger_kernel.ops.swiglu import LigerSiLUMulFunction
+# conform to the function signature in https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html
+# `weight` and `size_average` are placeholders and not implemented yet
+def liger_cross_entropy(
+    input,
+    target,
+    weight=None,
+    size_average=None,
+    ignore_index: int = -100,
+    reduce=None,
+    reduction: str = "mean",
+    label_smoothing: float = 0.0,
+    lse_square_scale: float = 0.0,
+    softcap: Optional[float] = None,
+    return_z_loss: bool = False,
+):
+    loss, z_loss = LigerCrossEntropyFunction.apply(
+        input,
+        target,
+        ignore_index,
+        lse_square_scale,
+        label_smoothing,
+        reduction,
+        softcap,
+        return_z_loss,
+    )
+    if not return_z_loss:
+        return loss
+    return loss, z_loss
+def liger_fused_linear_cross_entropy(
+    input,
+    weight,
+    target,
+    bias=None,
+    ignore_index: int = -100,
+    lse_square_scale: float = 0.0,
+    label_smoothing: float = 0.0,
+    reduction: str = "mean",
+    softcap: Optional[float] = None,
+):
+    return LigerFusedLinearCrossEntropyFunction.apply(
+        input,
+        weight,
+        target,
+        bias,
+        ignore_index,
+        lse_square_scale,
+        label_smoothing,
+        reduction,
+        softcap,
+    )
+def liger_fused_linear_jsd(
+    student_input,
+    student_weight,
+    teacher_input,
+    teacher_weight,
+    shift_labels=None,
+    jsd_beta: float = 0.5,
+    ignore_index: int = -100,
+    temperature: float = 1.0,
+):
+    return LigerFusedLinearJSDFunction.apply(
+        student_input,
+        student_weight,
+        teacher_input,
+        teacher_weight,
+        shift_labels,
+        jsd_beta,
+        ignore_index,
+        temperature,
+    )
+def liger_geglu(a, b):
+    return LigerGELUMulFunction.apply(a, b)
+def liger_group_norm(
+    X,
+    affine_scaling_weight,
+    affine_shifting_bias,
+    num_channels,
+    num_groups,
+    eps,
+):
+    return LigerGroupNormFunction.apply(
+        X,
+        affine_scaling_weight,
+        affine_shifting_bias,
+        num_channels,
+        num_groups,
+        eps,
+    )
+def liger_jsd(
+    input,
+    target,
+    shift_labels=None,
+    beta: float = 0.5,
+    ignore_index: int = -100,
+):
+    return LigerJSDFunction.apply(
+        input,
+        target,
+        shift_labels,
+        beta,
+        ignore_index,
+    )
+# conform to the function signature in https://pytorch.org/docs/stable/generated/torch.nn.functional.kl_div.html#torch.nn.functional.kl_div
+# `size_average` and `mean` are being deprecated in torch API and are placeholders here
+def liger_kl_div(
+    input,
+    target,
+    size_average: bool = True,
+    reduce: bool = True,
+    reduction: str = "mean",
+    log_target: bool = False,
+    eps: float = 1e-10,
+):
+    # Note: the default reduction in torch is `mean`, but being `batchmean` in Liger
+    return LigerKLDivLossFunction.apply(
+        input,
+        target,
+        reduction,
+        log_target,
+        eps,
+    )
+def liger_layer_norm(X, W, B, eps):
+    return LigerLayerNormFunction.apply(X, W, B, eps)
+def liger_qwen2vl_mrope(q, k, cos, sin, mrope_section, unsqueeze_dim=1):
+    return LigerQwen2VLMRopeFunction.apply(q, k, cos, sin, mrope_section, unsqueeze_dim)
+def liger_rms_norm(
+    X, W, eps, offset: float = 0.0, casting_mode: str = "llama", in_place: bool = True
+):
+    return LigerRMSNormFunction.apply(X, W, eps, offset, casting_mode, in_place)
+def liger_rope(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    return LigerRopeFunction.apply(q, k, cos, sin, position_ids, unsqueeze_dim)
+def liger_swiglu(a, b):
+    return LigerSiLUMulFunction.apply(a, b)

{liger_kernel_nightly-0.4.2.dev20241121054604 → liger_kernel_nightly-0.4.2.dev20241121225747}/src/liger_kernel/transformers/fused_linear_jsd.py RENAMED Viewed

@@ -12,7 +12,7 @@ class LigerFusedLinearJSD(torch.nn.Module):
     the materialization of the large logits tensor.
     Args:
-        jsd_beta (float): coefficient beta of generalized JSD in the open interval (0, 1). Default: `0.5`
+        jsd_beta (float): coefficient beta of generalized JSD in the interval [0, 1]. It implements forward/reverse KL when beta equals 0 and 1 respectively. Default: `0.5`
         ignore_index (int): The index to ignore in the target. Default: `-100`
         temperature (float): temperature in softmax function to control the output probability distribution. Default: `1.0`
@@ -70,9 +70,6 @@ class LigerFusedLinearJSD(torch.nn.Module):
     def __init__(self, jsd_beta=0.5, ignore_index=-100, temperature=1.0):
         super().__init__()
-        assert (
-            jsd_beta > 0 and jsd_beta < 1
-        ), f"beta must be greater than 0 and less than 1. Got: {jsd_beta}"
         assert temperature != 0, "temperature cannot be 0."
         self.jsd_beta = jsd_beta
         self.temperature = temperature

{liger_kernel_nightly-0.4.2.dev20241121054604 → liger_kernel_nightly-0.4.2.dev20241121225747}/src/liger_kernel/transformers/jsd.py RENAMED Viewed

@@ -18,7 +18,7 @@ class LigerJSD(torch.nn.Module):
     :math:`P` denotes the teacher model and :math:`Q` denotes the student model.
     Args:
-        beta (float): coefficient beta of generalized JSD in the open interval (0, 1). Default: `0.5`
+        beta (float): coefficient beta of generalized JSD in the interval [0, 1]. It implements forward/reverse KL when beta equals 0 and 1 respectively. Default: `0.5`
         ignore_index (int): The index to ignore in the target. Default: `-100`
     Shape:
@@ -58,9 +58,6 @@ class LigerJSD(torch.nn.Module):
     def __init__(self, beta: float = 0.5, ignore_index: int = -100):
         super().__init__()
-        assert (
-            beta > 0 and beta < 1
-        ), f"beta must be greater than 0 and less than 1. Got: {beta}"
         self.beta = beta
         self.ignore_index = ignore_index

{liger_kernel_nightly-0.4.2.dev20241121054604 → liger_kernel_nightly-0.4.2.dev20241121225747/src/liger_kernel_nightly.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: liger_kernel_nightly
-Version: 0.4.2.dev20241121054604
+Version: 0.4.2.dev20241121225747
 Summary: Efficient Triton kernels for LLM Training
 License: BSD 2-CLAUSE LICENSE
         Copyright 2024 LinkedIn Corporation
@@ -303,8 +303,8 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$
 <!-- TODO: verify vocab sizes are accurate  -->
 - **FusedLinearCrossEntropy**: Peak memory usage of cross entropy loss is further improved by fusing the model head with the CE loss and chunking the input for block-wise loss and gradient calculation, a technique inspired by [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy). It achieves >4X memory reduction for 128k vocab size. **This is highly effective for large batch size, large sequence length, and large vocabulary sizes.** Please refer to the [Medusa example](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) for individual kernel usage.
 - **KLDivergence**: [KL Divergence](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) is implemented by fusing the forward into a single triton kernel, with reduction done outside the kernel. It achieves ~1.5X speed and ~15% memory reduction for 128K vocab size.
-- **JSD**: [Generalized JSD](https://arxiv.org/pdf/2306.13649) (Jensen-Shannon divergence), is implemented by computing both the loss and gradient in the forward pass. It achieves ~1.5X speed and ~54% memory reduction for 128k vocab size.
-- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192.
+- **JSD**: [Generalized JSD](https://arxiv.org/pdf/2306.13649) (Jensen-Shannon divergence), is implemented by computing both the loss and gradient in the forward pass. It achieves ~1.5X speed and ~54% memory reduction for 128k vocab size. **NOTE**: It implements forward/reverse KL when `beta` equals 0 and 1 respectively.
+- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192. **NOTE**: It implements forward/reverse KL when `beta` equals 0 and 1 respectively.
 ### Experimental Kernels

liger_kernel_nightly-0.4.2.dev20241121054604/src/liger_kernel/transformers/functional.py DELETED Viewed

@@ -1,58 +0,0 @@
-from typing import Optional
-from liger_kernel.ops.cross_entropy import LigerCrossEntropyFunction
-from liger_kernel.ops.fused_linear_cross_entropy import (
-    LigerFusedLinearCrossEntropyFunction,
-)
-from liger_kernel.ops.fused_linear_jsd import LigerFusedLinearJSDFunction
-from liger_kernel.ops.geglu import LigerGELUMulFunction
-from liger_kernel.ops.group_norm import LigerGroupNormFunction
-from liger_kernel.ops.jsd import LigerJSDFunction
-from liger_kernel.ops.kl_div import LigerKLDivLossFunction
-from liger_kernel.ops.layer_norm import LigerLayerNormFunction
-from liger_kernel.ops.qwen2vl_mrope import LigerQwen2VLMRopeFunction
-from liger_kernel.ops.rms_norm import LigerRMSNormFunction
-from liger_kernel.ops.rope import LigerRopeFunction
-from liger_kernel.ops.swiglu import LigerSiLUMulFunction
-liger_swiglu = LigerSiLUMulFunction.apply
-liger_fused_linear_cross_entropy = LigerFusedLinearCrossEntropyFunction.apply
-liger_geglu = LigerGELUMulFunction.apply
-liger_rms_norm = LigerRMSNormFunction.apply
-liger_rope = LigerRopeFunction.apply
-liger_qwen2vl_mrope = LigerQwen2VLMRopeFunction.apply
-liger_layer_norm = LigerLayerNormFunction.apply
-liger_kl_div = LigerKLDivLossFunction.apply
-liger_jsd = LigerJSDFunction.apply
-liger_fused_linear_jsd = LigerFusedLinearJSDFunction.apply
-liger_group_norm = LigerGroupNormFunction.apply
-# conform to the function signature in https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html
-# `weight` and `size_average` are placeholders and not implemented yet
-def liger_cross_entropy(
-    input,
-    target,
-    weight=None,
-    size_average=None,
-    ignore_index: int = -100,
-    reduce=None,
-    reduction: str = "mean",
-    label_smoothing: float = 0.0,
-    lse_square_scale: float = 0.0,
-    softcap: Optional[float] = None,
-    return_z_loss: bool = False,
-):
-    loss, z_loss = LigerCrossEntropyFunction.apply(
-        input,
-        target,
-        ignore_index,
-        lse_square_scale,
-        label_smoothing,
-        reduction,
-        softcap,
-        return_z_loss,
-    )
-    if not return_z_loss:
-        return loss
-    return loss, z_loss