liger-kernel-nightly 0.4.0.dev20241107202516__tar.gz → 0.4.0.dev20241108173943__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of liger-kernel-nightly might be problematic. Click here for more details.
- {liger_kernel_nightly-0.4.0.dev20241107202516/src/liger_kernel_nightly.egg-info → liger_kernel_nightly-0.4.0.dev20241108173943}/PKG-INFO +3 -2
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/README.md +2 -1
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/pyproject.toml +1 -1
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943/src/liger_kernel_nightly.egg-info}/PKG-INFO +3 -2
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/LICENSE +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/NOTICE +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/setup.cfg +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/env_report.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/__init__.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/cross_entropy.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/experimental/embedding.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/experimental/mm_int8int2.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/fused_linear_cross_entropy.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/fused_linear_jsd.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/geglu.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/group_norm.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/jsd.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/kl_div.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/layer_norm.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/rms_norm.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/rope.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/swiglu.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/ops/utils.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/__init__.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/auto_model.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/cross_entropy.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/experimental/embedding.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/functional.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/fused_linear_cross_entropy.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/fused_linear_jsd.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/geglu.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/group_norm.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/jsd.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/kl_div.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/layer_norm.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/model/__init__.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/model/gemma.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/model/llama.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/model/mistral.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/model/mixtral.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/model/mllama.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/model/phi3.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/model/qwen2.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/model/qwen2_vl.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/monkey_patch.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/rms_norm.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/rope.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/swiglu.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/transformers/trainer_integration.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/triton/__init__.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel/triton/monkey_patch.py +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel_nightly.egg-info/SOURCES.txt +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel_nightly.egg-info/dependency_links.txt +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel_nightly.egg-info/requires.txt +0 -0
- {liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/src/liger_kernel_nightly.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: liger_kernel_nightly
|
|
3
|
-
Version: 0.4.0.
|
|
3
|
+
Version: 0.4.0.dev20241108173943
|
|
4
4
|
Summary: Efficient Triton kernels for LLM Training
|
|
5
5
|
License: BSD 2-CLAUSE LICENSE
|
|
6
6
|
Copyright 2024 LinkedIn Corporation
|
|
@@ -320,6 +320,7 @@ loss.backward()
|
|
|
320
320
|
|
|
321
321
|
- **RMSNorm**: [RMSNorm](https://arxiv.org/pdf/1910.07467), which normalizes activations using their root mean square, is implemented by fusing the normalization and scaling steps into a single Triton kernel, and achieves ~3X speedup with ~3X peak memory reduction.
|
|
322
322
|
- **LayerNorm**: [LayerNorm](https://arxiv.org/pdf/1607.06450), which centers and normalizes activations across the feature dimension, is implemented by fusing the centering, normalization and scaling steps into a single Triton kernel, and achieves ~2X speedup.
|
|
323
|
+
- **GroupNorm**: [GroupNorm](https://arxiv.org/pdf/1803.08494), which normalizes activations across the group dimension for a given sample. Channels are grouped in K groups over which the normalization is performed, is implemented by fusing the centering, normalization and scaling steps into a single Triton kernel, and can achieve up to ~2X speedup as the number of channels/groups increases.
|
|
323
324
|
- **RoPE**: [Rotary Positional Embedding](https://arxiv.org/pdf/2104.09864) is implemented by fusing the query and key embedding rotary into a single kernel with inplace replacement, and achieves ~3X speedup with ~3X peak memory reduction.
|
|
324
325
|
- **SwiGLU**: [Swish Gated Linear Units](https://arxiv.org/pdf/2002.05202), given by
|
|
325
326
|
$$\text{SwiGLU}(x)=\text{Swish}_{\beta}(xW+b)\otimes(xV+c)$$
|
|
@@ -332,7 +333,7 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$
|
|
|
332
333
|
- **FusedLinearCrossEntropy**: Peak memory usage of cross entropy loss is further improved by fusing the model head with the CE loss and chunking the input for block-wise loss and gradient calculation, a technique inspired by [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy). It achieves >4X memory reduction for 128k vocab size. **This is highly effective for large batch size, large sequence length, and large vocabulary sizes.** Please refer to the [Medusa example](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) for individual kernel usage.
|
|
333
334
|
- **KLDivergence**: [KL Divergence](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) is implemented by fusing the forward into a single triton kernel, with reduction done outside the kernel. It achieves ~1.5X speed and ~15% memory reduction for 128K vocab size.
|
|
334
335
|
- **JSD**: [Generalized JSD](https://arxiv.org/pdf/2306.13649) (Jensen-Shannon divergence), is implemented by computing both the loss and gradient in the forward pass. It achieves ~1.5X speed and ~54% memory reduction for 128k vocab size.
|
|
335
|
-
- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the
|
|
336
|
+
- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192.
|
|
336
337
|
|
|
337
338
|
|
|
338
339
|
### Experimental Kernels
|
|
@@ -273,6 +273,7 @@ loss.backward()
|
|
|
273
273
|
|
|
274
274
|
- **RMSNorm**: [RMSNorm](https://arxiv.org/pdf/1910.07467), which normalizes activations using their root mean square, is implemented by fusing the normalization and scaling steps into a single Triton kernel, and achieves ~3X speedup with ~3X peak memory reduction.
|
|
275
275
|
- **LayerNorm**: [LayerNorm](https://arxiv.org/pdf/1607.06450), which centers and normalizes activations across the feature dimension, is implemented by fusing the centering, normalization and scaling steps into a single Triton kernel, and achieves ~2X speedup.
|
|
276
|
+
- **GroupNorm**: [GroupNorm](https://arxiv.org/pdf/1803.08494), which normalizes activations across the group dimension for a given sample. Channels are grouped in K groups over which the normalization is performed, is implemented by fusing the centering, normalization and scaling steps into a single Triton kernel, and can achieve up to ~2X speedup as the number of channels/groups increases.
|
|
276
277
|
- **RoPE**: [Rotary Positional Embedding](https://arxiv.org/pdf/2104.09864) is implemented by fusing the query and key embedding rotary into a single kernel with inplace replacement, and achieves ~3X speedup with ~3X peak memory reduction.
|
|
277
278
|
- **SwiGLU**: [Swish Gated Linear Units](https://arxiv.org/pdf/2002.05202), given by
|
|
278
279
|
$$\text{SwiGLU}(x)=\text{Swish}_{\beta}(xW+b)\otimes(xV+c)$$
|
|
@@ -285,7 +286,7 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$
|
|
|
285
286
|
- **FusedLinearCrossEntropy**: Peak memory usage of cross entropy loss is further improved by fusing the model head with the CE loss and chunking the input for block-wise loss and gradient calculation, a technique inspired by [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy). It achieves >4X memory reduction for 128k vocab size. **This is highly effective for large batch size, large sequence length, and large vocabulary sizes.** Please refer to the [Medusa example](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) for individual kernel usage.
|
|
286
287
|
- **KLDivergence**: [KL Divergence](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) is implemented by fusing the forward into a single triton kernel, with reduction done outside the kernel. It achieves ~1.5X speed and ~15% memory reduction for 128K vocab size.
|
|
287
288
|
- **JSD**: [Generalized JSD](https://arxiv.org/pdf/2306.13649) (Jensen-Shannon divergence), is implemented by computing both the loss and gradient in the forward pass. It achieves ~1.5X speed and ~54% memory reduction for 128k vocab size.
|
|
288
|
-
- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the
|
|
289
|
+
- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192.
|
|
289
290
|
|
|
290
291
|
|
|
291
292
|
### Experimental Kernels
|
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "liger_kernel_nightly"
|
|
7
|
-
version = "0.4.0.
|
|
7
|
+
version = "0.4.0.dev20241108173943"
|
|
8
8
|
description = "Efficient Triton kernels for LLM Training"
|
|
9
9
|
urls = { "Homepage" = "https://github.com/linkedin/Liger-Kernel" }
|
|
10
10
|
readme = { file = "README.md", content-type = "text/markdown" }
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: liger_kernel_nightly
|
|
3
|
-
Version: 0.4.0.
|
|
3
|
+
Version: 0.4.0.dev20241108173943
|
|
4
4
|
Summary: Efficient Triton kernels for LLM Training
|
|
5
5
|
License: BSD 2-CLAUSE LICENSE
|
|
6
6
|
Copyright 2024 LinkedIn Corporation
|
|
@@ -320,6 +320,7 @@ loss.backward()
|
|
|
320
320
|
|
|
321
321
|
- **RMSNorm**: [RMSNorm](https://arxiv.org/pdf/1910.07467), which normalizes activations using their root mean square, is implemented by fusing the normalization and scaling steps into a single Triton kernel, and achieves ~3X speedup with ~3X peak memory reduction.
|
|
322
322
|
- **LayerNorm**: [LayerNorm](https://arxiv.org/pdf/1607.06450), which centers and normalizes activations across the feature dimension, is implemented by fusing the centering, normalization and scaling steps into a single Triton kernel, and achieves ~2X speedup.
|
|
323
|
+
- **GroupNorm**: [GroupNorm](https://arxiv.org/pdf/1803.08494), which normalizes activations across the group dimension for a given sample. Channels are grouped in K groups over which the normalization is performed, is implemented by fusing the centering, normalization and scaling steps into a single Triton kernel, and can achieve up to ~2X speedup as the number of channels/groups increases.
|
|
323
324
|
- **RoPE**: [Rotary Positional Embedding](https://arxiv.org/pdf/2104.09864) is implemented by fusing the query and key embedding rotary into a single kernel with inplace replacement, and achieves ~3X speedup with ~3X peak memory reduction.
|
|
324
325
|
- **SwiGLU**: [Swish Gated Linear Units](https://arxiv.org/pdf/2002.05202), given by
|
|
325
326
|
$$\text{SwiGLU}(x)=\text{Swish}_{\beta}(xW+b)\otimes(xV+c)$$
|
|
@@ -332,7 +333,7 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$
|
|
|
332
333
|
- **FusedLinearCrossEntropy**: Peak memory usage of cross entropy loss is further improved by fusing the model head with the CE loss and chunking the input for block-wise loss and gradient calculation, a technique inspired by [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy). It achieves >4X memory reduction for 128k vocab size. **This is highly effective for large batch size, large sequence length, and large vocabulary sizes.** Please refer to the [Medusa example](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) for individual kernel usage.
|
|
333
334
|
- **KLDivergence**: [KL Divergence](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) is implemented by fusing the forward into a single triton kernel, with reduction done outside the kernel. It achieves ~1.5X speed and ~15% memory reduction for 128K vocab size.
|
|
334
335
|
- **JSD**: [Generalized JSD](https://arxiv.org/pdf/2306.13649) (Jensen-Shannon divergence), is implemented by computing both the loss and gradient in the forward pass. It achieves ~1.5X speed and ~54% memory reduction for 128k vocab size.
|
|
335
|
-
- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the
|
|
336
|
+
- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192.
|
|
336
337
|
|
|
337
338
|
|
|
338
339
|
### Experimental Kernels
|
|
File without changes
|
{liger_kernel_nightly-0.4.0.dev20241107202516 → liger_kernel_nightly-0.4.0.dev20241108173943}/NOTICE
RENAMED
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|