PyPI - liger-kernel-nightly - Versions diffs - 0.3.1.dev20241101201851__tar.gz → 0.3.1.dev20241102170757__tar.gz - Mend

liger-kernel-nightly 0.3.1.dev20241101201851tar.gz → 0.3.1.dev20241102170757tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of liger-kernel-nightly might be problematic. Click here for more details.

Files changed (58) hide show

{liger_kernel_nightly-0.3.1.dev20241101201851 → liger_kernel_nightly-0.3.1.dev20241102170757}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: liger_kernel_nightly
-Version: 0.3.1.dev20241101201851
+Version: 0.3.1.dev20241102170757
 Summary: Efficient Triton kernels for LLM Training
 License: BSD 2-CLAUSE LICENSE
         Copyright 2024 LinkedIn Corporation
@@ -163,11 +163,18 @@ With one line of code, Liger Kernel can increase throughput by more than 20% and
 ## Installation
-### Dependencies
+### Dependencies
+#### CUDA
 - `torch >= 2.1.2`
 - `triton >= 2.3.0`
+#### ROCm
+- `torch >= 2.5.0` Install according to the instruction in Pytorch official webpage.
+- `triton >= 3.0.0` Install from pypi. (e.g. `pip install triton==3.0.0`)
 ### Optional Dependencies
 - `transformers >= 4.x`: Required if you plan to use the transformers models patching APIs. The specific model you are working will dictate the minimum version of transformers.
@@ -197,6 +204,7 @@ pip install -e .
 pip install -e .[transformers]
 ```
 ## Getting Started
 There are a couple of ways to apply Liger kernels, depending on the level of customization required.

{liger_kernel_nightly-0.3.1.dev20241101201851 → liger_kernel_nightly-0.3.1.dev20241102170757}/README.md RENAMED Viewed

@@ -111,11 +111,18 @@ With one line of code, Liger Kernel can increase throughput by more than 20% and
 ## Installation
-### Dependencies
+### Dependencies
+#### CUDA
 - `torch >= 2.1.2`
 - `triton >= 2.3.0`
+#### ROCm
+- `torch >= 2.5.0` Install according to the instruction in Pytorch official webpage.
+- `triton >= 3.0.0` Install from pypi. (e.g. `pip install triton==3.0.0`)
 ### Optional Dependencies
 - `transformers >= 4.x`: Required if you plan to use the transformers models patching APIs. The specific model you are working will dictate the minimum version of transformers.
@@ -145,6 +152,7 @@ pip install -e .
 pip install -e .[transformers]
 ```
 ## Getting Started
 There are a couple of ways to apply Liger kernels, depending on the level of customization required.

{liger_kernel_nightly-0.3.1.dev20241101201851 → liger_kernel_nightly-0.3.1.dev20241102170757}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "liger_kernel_nightly"
-version = "0.3.1.dev20241101201851"
+version = "0.3.1.dev20241102170757"
 description = "Efficient Triton kernels for LLM Training"
 urls = { "Homepage" = "https://github.com/linkedin/Liger-Kernel" }
 readme = { file = "README.md", content-type = "text/markdown" }

{liger_kernel_nightly-0.3.1.dev20241101201851 → liger_kernel_nightly-0.3.1.dev20241102170757}/src/liger_kernel/ops/cross_entropy.py RENAMED Viewed

@@ -2,7 +2,7 @@ import torch
 import triton
 import triton.language as tl
-from liger_kernel.ops.utils import element_mul_kernel
+from liger_kernel.ops.utils import element_mul_kernel, is_hip
 @triton.jit
@@ -194,7 +194,7 @@ def cross_entropy_forward(_input, target, ignore_index, label_smoothing, reducti
         BLOCK_SIZE=BLOCK_SIZE,
         # TODO: 32 seems to give the best performance
         # Performance is quite sensitive to num_warps
-        num_warps=32,
+        num_warps=32 if not is_hip() else 16,
     )
     loss = torch.sum(loss_1d)
@@ -219,7 +219,7 @@ def cross_entropy_backward(_input, grad_output):
             grad_output,
             V,
             BLOCK_SIZE=BLOCK_SIZE,
-            num_warps=32,
+            num_warps=32 if not is_hip() else 16,
         )
     return _input

{liger_kernel_nightly-0.3.1.dev20241101201851 → liger_kernel_nightly-0.3.1.dev20241102170757}/src/liger_kernel/ops/fused_linear_cross_entropy.py RENAMED Viewed

@@ -2,7 +2,12 @@ import torch
 import triton
 from liger_kernel.ops.cross_entropy import liger_cross_entropy_kernel
-from liger_kernel.ops.utils import amp_custom_bwd, amp_custom_fwd, element_mul_kernel
+from liger_kernel.ops.utils import (
+    amp_custom_bwd,
+    amp_custom_fwd,
+    element_mul_kernel,
+    is_hip,
+)
 # The hard limit of TRITON_MAX_TENSOR_NUMEL is 1048576 https://github.com/triton-lang/triton/blob/ba42a5c68fd0505f8c42f4202d53be0f8d9a5fe0/python/triton/language/core.py#L19
 # However, setting limit as 65536 as in LayerNorm tutorial is faster because of less register spilling
@@ -88,7 +93,7 @@ def fused_linear_cross_entropy_forward(
             label_smoothing=label_smoothing,
             reduction=reduction,
             BLOCK_SIZE=BLOCK_SIZE,
-            num_warps=32,
+            num_warps=32 if not is_hip() else 16,
         )
         # gradient of logits_chunk is computed in-place by the above triton kernel.
@@ -153,7 +158,7 @@ def fused_linear_cross_entropy_backward(
             grad_output,
             H,
             BLOCK_SIZE=BLOCK_SIZE,
-            num_warps=32,
+            num_warps=32 if not is_hip() else 16,
         )
         # handle grad_weight
@@ -167,7 +172,7 @@ def fused_linear_cross_entropy_backward(
                 grad_output,
                 H,
                 BLOCK_SIZE=BLOCK_SIZE,
-                num_warps=32,
+                num_warps=32 if not is_hip() else 16,
             )
         if grad_bias is not None:
@@ -180,7 +185,7 @@ def fused_linear_cross_entropy_backward(
                 grad_output,
                 1,
                 BLOCK_SIZE=BLOCK_SIZE,
-                num_warps=32,
+                num_warps=32 if not is_hip() else 16,
             )
     return grad_input, grad_weight, grad_bias

{liger_kernel_nightly-0.3.1.dev20241101201851 → liger_kernel_nightly-0.3.1.dev20241102170757}/src/liger_kernel/ops/fused_linear_jsd.py RENAMED Viewed

@@ -4,7 +4,12 @@ import torch
 import triton
 from liger_kernel.ops.jsd import _jsd_kernel
-from liger_kernel.ops.utils import amp_custom_bwd, amp_custom_fwd, element_mul_kernel
+from liger_kernel.ops.utils import (
+    amp_custom_bwd,
+    amp_custom_fwd,
+    element_mul_kernel,
+    is_hip,
+)
 # The hard limit of TRITON_MAX_TENSOR_NUMEL is 1048576 https://github.com/triton-lang/triton/blob/ba42a5c68fd0505f8c42f4202d53be0f8d9a5fe0/python/triton/language/core.py#L19
 # However, setting limit as 65536 as in LayerNorm tutorial is faster because of less register spilling
@@ -147,7 +152,7 @@ def fused_linear_jsd_backward(grad_output, grad_input, grad_weight):
             grad_output,
             H,
             BLOCK_SIZE=BLOCK_SIZE,
-            num_warps=32,
+            num_warps=32 if not is_hip() else 16,
         )
         # handle grad_weight
@@ -161,7 +166,7 @@ def fused_linear_jsd_backward(grad_output, grad_input, grad_weight):
                 grad_output,
                 H,
                 BLOCK_SIZE=BLOCK_SIZE,
-                num_warps=32,
+                num_warps=32 if not is_hip() else 16,
             )
     return grad_input, grad_weight

{liger_kernel_nightly-0.3.1.dev20241101201851 → liger_kernel_nightly-0.3.1.dev20241102170757}/src/liger_kernel/ops/kl_div.py RENAMED Viewed

@@ -4,13 +4,13 @@ import torch
 import triton
 import triton.language as tl
-from liger_kernel.ops.utils import ensure_contiguous
+from liger_kernel.ops.utils import ensure_contiguous, is_hip
 def get_num_warps(BLOCK_SIZE):
     num_warps = 4
     if BLOCK_SIZE >= 32768:
-        num_warps = 32
+        num_warps = 32 if not is_hip() else 16
     elif BLOCK_SIZE >= 8192:
         num_warps = 16
     elif BLOCK_SIZE >= 2048:

{liger_kernel_nightly-0.3.1.dev20241101201851 → liger_kernel_nightly-0.3.1.dev20241102170757}/src/liger_kernel/ops/utils.py RENAMED Viewed

@@ -21,6 +21,10 @@ import triton.language as tl
 from packaging.version import Version
+def is_hip() -> bool:
+    return torch.version.hip is not None
 def ensure_contiguous(fn):
     @functools.wraps(fn)
     def wrapper(ctx, *args, **kwargs):
@@ -47,7 +51,7 @@ def calculate_settings(n):
     num_warps = 4
     if BLOCK_SIZE >= 32768:
-        num_warps = 32
+        num_warps = 32 if not is_hip() else 16
     elif BLOCK_SIZE >= 8192:
         num_warps = 16
     elif BLOCK_SIZE >= 2048:

{liger_kernel_nightly-0.3.1.dev20241101201851 → liger_kernel_nightly-0.3.1.dev20241102170757}/src/liger_kernel/transformers/model/llama.py RENAMED Viewed

@@ -1,4 +1,4 @@
-from typing import List, Optional, Tuple, Union
+from typing import TYPE_CHECKING, List, Optional, Tuple, Union
 import torch
 import torch.nn.functional as F
@@ -17,6 +17,9 @@ from liger_kernel.transformers.fused_linear_cross_entropy import (
     LigerFusedLinearCrossEntropyLoss,
 )
+if TYPE_CHECKING:
+    from transformers.cache_utils import Cache
 @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
 @replace_return_docstrings(
@@ -27,7 +30,7 @@ def lce_forward_deprecated(
     input_ids: torch.LongTensor = None,
     attention_mask: Optional[torch.Tensor] = None,
     position_ids: Optional[torch.LongTensor] = None,
-    past_key_values: Optional[List[torch.FloatTensor]] = None,
+    past_key_values: Optional[Union["Cache", List[torch.FloatTensor]]] = None,
     inputs_embeds: Optional[torch.FloatTensor] = None,
     labels: Optional[torch.LongTensor] = None,
     use_cache: Optional[bool] = None,
@@ -153,19 +156,19 @@ def lce_forward_deprecated(
 )
 def lce_forward(
     self,
-    input_ids=None,
-    attention_mask=None,
-    position_ids=None,
-    past_key_values=None,
-    inputs_embeds=None,
-    labels=None,
-    use_cache=None,
-    output_attentions=None,
-    output_hidden_states=None,
-    return_dict=None,
-    cache_position=None,
-    num_logits_to_keep=0,
-    **kwargs,
+    input_ids: torch.LongTensor = None,
+    attention_mask: Optional[torch.Tensor] = None,
+    position_ids: Optional[torch.LongTensor] = None,
+    past_key_values: Optional[Union["Cache", List[torch.FloatTensor]]] = None,
+    inputs_embeds: Optional[torch.FloatTensor] = None,
+    labels: Optional[torch.LongTensor] = None,
+    use_cache: Optional[bool] = None,
+    output_attentions: Optional[bool] = None,
+    output_hidden_states: Optional[bool] = None,
+    return_dict: Optional[bool] = None,
+    cache_position: Optional[torch.LongTensor] = None,
+    num_logits_to_keep: int = 0,
+    **loss_kwargs,
 ) -> Union[Tuple, CausalLMOutputWithPast]:
     r"""
     Args:
@@ -224,7 +227,6 @@ def lce_forward(
         output_hidden_states=output_hidden_states,
         return_dict=return_dict,
         cache_position=cache_position,
-        **kwargs,
     )
     hidden_states = outputs[0]
@@ -245,12 +247,12 @@ def lce_forward(
         shift_hidden_states = shift_hidden_states.view(-1, self.config.hidden_size)
         shift_labels = shift_labels.view(-1)
-        reduction = "sum" if "num_items_in_batch" in kwargs else "mean"
+        reduction = "sum" if "num_items_in_batch" in loss_kwargs else "mean"
         lce = LigerFusedLinearCrossEntropyLoss(reduction=reduction)
         loss = lce(self.lm_head.weight, shift_hidden_states, shift_labels)
         if reduction == "sum":
-            loss /= kwargs["num_items_in_batch"]
+            loss /= loss_kwargs["num_items_in_batch"]
     else:  # if in inference mode materialize logits
         logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :])
@@ -259,7 +261,7 @@ def lce_forward(
                 logits=logits,
                 labels=labels,
                 vocab_size=self.config.vocab_size,
-                **kwargs,
+                **loss_kwargs,
             )
     if not return_dict:

{liger_kernel_nightly-0.3.1.dev20241101201851 → liger_kernel_nightly-0.3.1.dev20241102170757}/src/liger_kernel_nightly.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: liger_kernel_nightly
-Version: 0.3.1.dev20241101201851
+Version: 0.3.1.dev20241102170757
 Summary: Efficient Triton kernels for LLM Training
 License: BSD 2-CLAUSE LICENSE
         Copyright 2024 LinkedIn Corporation
@@ -163,11 +163,18 @@ With one line of code, Liger Kernel can increase throughput by more than 20% and
 ## Installation
-### Dependencies
+### Dependencies
+#### CUDA
 - `torch >= 2.1.2`
 - `triton >= 2.3.0`
+#### ROCm
+- `torch >= 2.5.0` Install according to the instruction in Pytorch official webpage.
+- `triton >= 3.0.0` Install from pypi. (e.g. `pip install triton==3.0.0`)
 ### Optional Dependencies
 - `transformers >= 4.x`: Required if you plan to use the transformers models patching APIs. The specific model you are working will dictate the minimum version of transformers.
@@ -197,6 +204,7 @@ pip install -e .
 pip install -e .[transformers]
 ```
 ## Getting Started
 There are a couple of ways to apply Liger kernels, depending on the level of customization required.