PyPI - liger-kernel-nightly - Versions diffs - 0.4.0.dev20241107194223__tar.gz → 0.4.0.dev20241108173850__tar.gz - Mend

liger-kernel-nightly 0.4.0.dev20241107194223tar.gz → 0.4.0.dev20241108173850tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of liger-kernel-nightly might be problematic. Click here for more details.

Files changed (55) hide show

{liger_kernel_nightly-0.4.0.dev20241107194223/src/liger_kernel_nightly.egg-info → liger_kernel_nightly-0.4.0.dev20241108173850}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: liger_kernel_nightly
-Version: 0.4.0.dev20241107194223
+Version: 0.4.0.dev20241108173850
 Summary: Efficient Triton kernels for LLM Training
 License: BSD 2-CLAUSE LICENSE
         Copyright 2024 LinkedIn Corporation
@@ -332,7 +332,7 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$
 - **FusedLinearCrossEntropy**: Peak memory usage of cross entropy loss is further improved by fusing the model head with the CE loss and chunking the input for block-wise loss and gradient calculation, a technique inspired by [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy). It achieves >4X memory reduction for 128k vocab size. **This is highly effective for large batch size, large sequence length, and large vocabulary sizes.** Please refer to the [Medusa example](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) for individual kernel usage.
 - **KLDivergence**: [KL Divergence](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) is implemented by fusing the forward into a single triton kernel, with reduction done outside the kernel. It achieves ~1.5X speed and ~15% memory reduction for 128K vocab size.
 - **JSD**: [Generalized JSD](https://arxiv.org/pdf/2306.13649) (Jensen-Shannon divergence), is implemented by computing both the loss and gradient in the forward pass. It achieves ~1.5X speed and ~54% memory reduction for 128k vocab size.
-- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192.
+- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192.
 ### Experimental Kernels

{liger_kernel_nightly-0.4.0.dev20241107194223 → liger_kernel_nightly-0.4.0.dev20241108173850}/README.md RENAMED Viewed

@@ -285,7 +285,7 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$
 - **FusedLinearCrossEntropy**: Peak memory usage of cross entropy loss is further improved by fusing the model head with the CE loss and chunking the input for block-wise loss and gradient calculation, a technique inspired by [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy). It achieves >4X memory reduction for 128k vocab size. **This is highly effective for large batch size, large sequence length, and large vocabulary sizes.** Please refer to the [Medusa example](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) for individual kernel usage.
 - **KLDivergence**: [KL Divergence](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) is implemented by fusing the forward into a single triton kernel, with reduction done outside the kernel. It achieves ~1.5X speed and ~15% memory reduction for 128K vocab size.
 - **JSD**: [Generalized JSD](https://arxiv.org/pdf/2306.13649) (Jensen-Shannon divergence), is implemented by computing both the loss and gradient in the forward pass. It achieves ~1.5X speed and ~54% memory reduction for 128k vocab size.
-- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192.
+- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192.
 ### Experimental Kernels

{liger_kernel_nightly-0.4.0.dev20241107194223 → liger_kernel_nightly-0.4.0.dev20241108173850}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "liger_kernel_nightly"
-version = "0.4.0.dev20241107194223"
+version = "0.4.0.dev20241108173850"
 description = "Efficient Triton kernels for LLM Training"
 urls = { "Homepage" = "https://github.com/linkedin/Liger-Kernel" }
 readme = { file = "README.md", content-type = "text/markdown" }

liger_kernel_nightly-0.4.0.dev20241108173850/src/liger_kernel/ops/group_norm.py ADDED Viewed

@@ -0,0 +1,322 @@
+import operator
+import torch
+import triton
+import triton.language as tl
+from liger_kernel.ops.utils import compare_version, ensure_contiguous
+if compare_version("triton", operator.ge, "3.0.0"):
+    try:
+        # typical import path with dispatch available
+        from triton.language.extra.libdevice import rsqrt
+    except ModuleNotFoundError:
+        # for working with NGC containers
+        from triton.language.extra.cuda.libdevice import rsqrt
+else:
+    from triton.language.math import rsqrt
+MAX_FUSED_SIZE = 65536
+@triton.jit
+def _group_norm_forward_kernel(
+    Y_ptr,  # pointer to output, shape (n_rows, n_groups, hidden_size)
+    Y_row_stride,  # stride of each row in output
+    Y_col_stride,  # stride of each column in output
+    X_ptr,  # pointer to input, shape (n_rows, n_groups, hidden_size)
+    X_row_stride,  # stride of each row in input
+    X_col_stride,  # stride of each column in input
+    Mean_ptr,  # pointer to mean, shape (n_rows, n_groups)
+    Mean_row_stride,  # stride of each row in mean
+    Mean_col_stride,  # stride of each column in mean
+    RSTD_ptr,  # pointer to rstd, shape (n_rows, n_groups)
+    RSTD_row_stride,  # stride of each row in rstd
+    RSTD_col_stride,  # stride of each column in rstd
+    W_ptr,  # pointer to W
+    B_ptr,  # pointer to B
+    hidden_size,  # hidden size of X
+    channels_per_group,  # the number of channels per group
+    eps,
+    BLOCK_SIZE: tl.constexpr,
+):
+    """
+    References:
+    https://nn.labml.ai/normalization/group_norm/index.html
+    """
+    batch_idx = tl.program_id(0)
+    group_idx = tl.program_id(1)
+    X_ptr += batch_idx * X_row_stride + group_idx * X_col_stride
+    Y_ptr += batch_idx * Y_row_stride + group_idx * Y_col_stride
+    block_range = tl.arange(0, BLOCK_SIZE)
+    # Compute mean and variance using the online algorithm
+    s = 0.0
+    squared_sum = 0.0
+    for i in tl.range(0, hidden_size, BLOCK_SIZE):
+        hidden_size_offsets = i + block_range
+        mask = hidden_size_offsets < hidden_size
+        X = tl.load(X_ptr + hidden_size_offsets, mask=mask, other=0.0)
+        s += tl.sum(X)
+        # X**2
+        squared_sum += tl.sum(X * X)
+    m = s / hidden_size
+    # variance = E[X**2] - E[X]**2
+    variance = (squared_sum / hidden_size) - (m * m)
+    # 1/std
+    rstd = rsqrt(variance + eps)
+    # Normalize
+    hidden_size_per_channel = hidden_size // channels_per_group
+    for channel_idx in tl.range(
+        group_idx * channels_per_group, (group_idx + 1) * channels_per_group
+    ):
+        W = tl.load(W_ptr + channel_idx)
+        B = tl.load(B_ptr + channel_idx)
+        for i in range(0, hidden_size_per_channel, BLOCK_SIZE):
+            hidden_size_offsets = i + block_range
+            mask = hidden_size_offsets < hidden_size_per_channel
+            X = tl.load(X_ptr + hidden_size_offsets, mask=mask, other=m)
+            Y = (X - m) * rstd * W + B
+            tl.store(Y_ptr + hidden_size_offsets, Y, mask=mask)
+        X_ptr += hidden_size_per_channel
+        Y_ptr += hidden_size_per_channel
+    tl.store(Mean_ptr + batch_idx * Mean_row_stride + group_idx * Mean_col_stride, m)
+    tl.store(RSTD_ptr + batch_idx * RSTD_row_stride + group_idx * RSTD_col_stride, rstd)
+@triton.jit
+def _group_norm_backward_kernel(
+    X_ptr,  # pointer to input, shape (n_rows, n_channels, hidden_size)
+    X_row_stride,  # stride of each row in input
+    X_col_stride,  # stride of each column in input
+    W_ptr,  # pointer to weights, shape (n_channels)
+    Mean_ptr,  # pointer to mean, shape (n_rows, n_groups)
+    Mean_ptr_row_stride,  # stride of each column in mean
+    Mean_ptr_col_stride,  # stride of each column in mean
+    RSTD_ptr,  # pointer to rstd, shape (n_rows, n_groups)
+    DX_ptr,  # pointer to input grad, shape (n_rows, n_groups, hidden_size)
+    DW_ptr,  # pointer to weights grad, shape (n_channels)
+    DB_ptr,  # pointer to bias grad, shape (n_channels)
+    UPSTREAM_ptr,  # pointer to output grad, shape (n_rows, n_channels, hidden_size)
+    hidden_size: tl.constexpr,  # hidden size
+    channels_per_group: tl.constexpr,  # number of groups in group norm
+    BLOCK_SIZE: tl.constexpr,
+    dtype: tl.constexpr,
+):
+    """
+    References:
+    https://nn.labml.ai/normalization/group_norm/index.html
+    https://github.com/karpathy/llm.c/blob/master/doc/layernorm/layernorm.md
+    The backprop equations are the same for group_norm and layer_norm
+    the only difference here is that we load the Mean, Rstd corresponding to the
+    group we're computing gradients for and the mean and rstd are computed over n-channels
+    so the total number of elements we compute the mean over is num_channels_per_group * hidden_size
+    We also need to load the Weights corresponding to the current channel to compute the gradients.
+    """
+    batch_idx = tl.program_id(0)
+    group_idx = tl.program_id(1)
+    # Move the pointers to the correct batch
+    X_ptr += batch_idx * X_row_stride
+    DX_ptr += batch_idx * X_row_stride
+    UPSTREAM_ptr += batch_idx * X_row_stride
+    # Mean and rstd are the same shape so have the same strides
+    mean = tl.load(
+        Mean_ptr + batch_idx * Mean_ptr_row_stride + group_idx * Mean_ptr_col_stride
+    )
+    rstd = tl.load(
+        RSTD_ptr + batch_idx * Mean_ptr_row_stride + group_idx * Mean_ptr_col_stride
+    )
+    c1 = 0.0
+    c2 = 0.0
+    block_range = tl.arange(0, BLOCK_SIZE)
+    # We need to compute the sum terms of the backprop equations across all channels in the group
+    for channel_idx in range(
+        group_idx * channels_per_group, (group_idx + 1) * channels_per_group
+    ):
+        dW = 0.0
+        dB = 0.0
+        # Move the pointers to the correct channel
+        W = tl.load(W_ptr + channel_idx)
+        for i in tl.range(0, hidden_size, BLOCK_SIZE):
+            hidden_size_offsets = i + block_range
+            mask = hidden_size_offsets < hidden_size
+            X = tl.load(
+                X_ptr + channel_idx * X_col_stride + hidden_size_offsets,
+                mask=mask,
+                other=0.0,
+            )
+            UPSTREAM_grad = tl.load(
+                UPSTREAM_ptr + channel_idx * X_col_stride + hidden_size_offsets,
+                mask=mask,
+                other=0.0,
+            )
+            x_hat = (X - mean) * rstd
+            dW += tl.sum(UPSTREAM_grad * x_hat)
+            dB += tl.sum(UPSTREAM_grad)
+            wdy = W * UPSTREAM_grad
+            c1 += tl.sum(x_hat * wdy)
+            c2 += tl.sum(wdy)
+        # Need to ensure additions to the same channel are atomic
+        tl.atomic_add(DW_ptr + channel_idx, dW.to(dtype))
+        tl.atomic_add(DB_ptr + channel_idx, dB.to(dtype))
+    N = hidden_size * channels_per_group
+    c1 = c1 / N
+    c2 = c2 / N
+    for channel_idx in tl.range(
+        group_idx * channels_per_group, (group_idx + 1) * channels_per_group
+    ):
+        # Move the pointers to the correct channel
+        W = tl.load(W_ptr + channel_idx)
+        for i in range(0, hidden_size, BLOCK_SIZE):
+            hidden_size_offsets = i + block_range
+            mask = hidden_size_offsets < hidden_size
+            X = tl.load(
+                X_ptr + channel_idx * X_col_stride + hidden_size_offsets,
+                mask=mask,
+                other=0.0,
+            )
+            UPSTREAM_grad = tl.load(
+                UPSTREAM_ptr + channel_idx * X_col_stride + hidden_size_offsets,
+                mask=mask,
+                other=0.0,
+            )
+            x_hat = (X - mean) * rstd
+            wdy = W * UPSTREAM_grad
+            dx = (wdy - (x_hat * c1 + c2)) * rstd
+            tl.store(
+                DX_ptr + channel_idx * X_col_stride + hidden_size_offsets, dx, mask=mask
+            )
+def group_norm_forward(X, num_channels, num_groups, W, B, eps):
+    shape = X.shape
+    batch_size = shape[0]
+    channels_per_group = num_channels // num_groups
+    # Reshape X so that the mean and std are computed across the groups
+    X = X.view(batch_size, num_groups, -1).contiguous()
+    hidden_size = X.shape[-1]
+    BLOCK_SIZE = min(MAX_FUSED_SIZE, triton.next_power_of_2(hidden_size))
+    Y = torch.empty(
+        (batch_size, num_groups, hidden_size), dtype=X.dtype, device=X.device
+    )
+    Mean = torch.zeros((batch_size, num_groups), dtype=X.dtype, device=X.device)
+    RSTD = torch.zeros((batch_size, num_groups), dtype=X.dtype, device=X.device)
+    _group_norm_forward_kernel[(batch_size, num_groups)](
+        Y,
+        Y.stride(0),
+        Y.stride(1),
+        X,
+        X.stride(0),
+        X.stride(1),
+        Mean,
+        Mean.stride(0),
+        Mean.stride(1),
+        RSTD,
+        RSTD.stride(0),
+        RSTD.stride(1),
+        W,
+        B,
+        hidden_size,
+        channels_per_group,
+        eps,
+        BLOCK_SIZE=BLOCK_SIZE,
+    )
+    # Return tensors in the original shape
+    return Y.view(*shape), X.view(*shape), Mean, RSTD, BLOCK_SIZE
+def group_norm_backward(dY, X, W, B, Mean, RSTD, num_channels, num_groups):
+    shape = dY.shape
+    batch_size = shape[0]
+    hidden_size = dY.shape[-1]
+    channels_per_group = num_channels // num_groups
+    dY = dY.view(batch_size, num_groups, -1)
+    DX = torch.empty(
+        (batch_size, num_groups, hidden_size * channels_per_group),
+        dtype=X.dtype,
+        device=X.device,
+    )
+    DW = torch.zeros((num_channels), dtype=W.dtype, device=W.device)
+    DB = torch.zeros((num_channels), dtype=B.dtype, device=B.device)
+    triton_dtype = tl.float32 if X.dtype == torch.float32 else tl.bfloat16
+    BLOCK_SIZE = min(MAX_FUSED_SIZE, triton.next_power_of_2(hidden_size))
+    _group_norm_backward_kernel[(batch_size, num_groups)](
+        X,
+        X.stride(0),
+        X.stride(1),
+        W,
+        Mean,
+        Mean.stride(0),
+        Mean.stride(1),
+        RSTD,
+        DX,
+        DW,
+        DB,
+        dY,
+        hidden_size,
+        channels_per_group,
+        BLOCK_SIZE=BLOCK_SIZE,
+        dtype=triton_dtype,
+    )
+    # Return tensors in the original shape
+    return DX.view(*shape), DW, DB
+class LigerGroupNormFunction(torch.autograd.Function):
+    @staticmethod
+    @ensure_contiguous
+    def forward(
+        ctx,
+        X,
+        affine_scaling_weight,
+        affine_shifting_bias,
+        num_channels,
+        num_groups,
+        eps,
+    ):
+        Y, X, Mean, RSTD, BLOCK_SIZE = group_norm_forward(
+            X,
+            num_channels,
+            num_groups,
+            affine_scaling_weight,
+            affine_shifting_bias,
+            eps,
+        )
+        ctx.num_channels = num_channels
+        ctx.num_groups = num_groups
+        ctx.save_for_backward(
+            X, affine_scaling_weight, affine_shifting_bias, Mean, RSTD
+        )
+        return Y
+    @staticmethod
+    @ensure_contiguous
+    def backward(ctx, dY):
+        X, W, B, Mean, RSTD = ctx.saved_tensors
+        DX, DW, DB = group_norm_backward(
+            dY, X, W, B, Mean, RSTD, ctx.num_channels, ctx.num_groups
+        )
+        return DX, DW, DB, None, None, None

{liger_kernel_nightly-0.4.0.dev20241107194223 → liger_kernel_nightly-0.4.0.dev20241108173850}/src/liger_kernel/transformers/functional.py RENAMED Viewed

@@ -4,6 +4,7 @@ from liger_kernel.ops.fused_linear_cross_entropy import (
 )
 from liger_kernel.ops.fused_linear_jsd import LigerFusedLinearJSDFunction
 from liger_kernel.ops.geglu import LigerGELUMulFunction
+from liger_kernel.ops.group_norm import LigerGroupNormFunction
 from liger_kernel.ops.jsd import LigerJSDFunction
 from liger_kernel.ops.kl_div import LigerKLDivLossFunction
 from liger_kernel.ops.layer_norm import LigerLayerNormFunction
@@ -21,3 +22,4 @@ liger_layer_norm = LigerLayerNormFunction.apply
 liger_kl_div = LigerKLDivLossFunction.apply
 liger_jsd = LigerJSDFunction.apply
 liger_fused_linear_jsd = LigerFusedLinearJSDFunction.apply
+liger_group_norm = LigerGroupNormFunction.apply

liger_kernel_nightly-0.4.0.dev20241108173850/src/liger_kernel/transformers/group_norm.py ADDED Viewed

@@ -0,0 +1,56 @@
+import torch
+import torch.nn as nn
+from liger_kernel.ops.group_norm import LigerGroupNormFunction
+class LigerGroupNorm(nn.Module):
+    def __init__(self, num_channels, num_groups, eps=1e-6, bias=False, init_fn="ones"):
+        """
+        A Group Normalization layer.
+        Args:
+            num_channels (int): Number of channels in the input tensor.
+            num_groups (int): Number of groups to divide the channels into.
+            eps (float, optional): A value added to the denominator for numerical stability. Default: 1e-6.
+            bias (bool, optional): If ``True``, adds a learnable bias to the output. Default: ``False``.
+            init_fn (str, optional): Initialization function for the learnable parameters. Default: "ones".
+        """
+        super().__init__()
+        assert init_fn in [
+            "ones",
+            "zeros",
+        ], f"init_fn must be either 'ones' or 'zeros', got {init_fn}"
+        assert (
+            num_channels % num_groups == 0
+        ), f"Number of channels {num_channels} must be divisible by num_groups {num_groups}"
+        self.num_channels = num_channels
+        self.num_groups = num_groups
+        self.eps = eps
+        self.weight = nn.Parameter(
+            torch.ones(num_channels) if init_fn == "ones" else torch.zeros(num_channels)
+        )
+        self.bias = nn.Parameter(
+            torch.randn(num_channels) if bias else torch.zeros(num_channels)
+        )
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        # hidden_states: (batch_size, num_channels, *)
+        assert (
+            hidden_states.dim() >= 3
+        ), f"Input must have atleast 3 dimensions, got {hidden_states.dim()}"
+        assert (
+            hidden_states.size(1) == self.num_channels
+        ), f"Input tensor must have {self.num_channels} channels, got {hidden_states.size(1)}"
+        return LigerGroupNormFunction.apply(
+            hidden_states,
+            self.weight,
+            self.bias,
+            self.num_channels,
+            self.num_groups,
+            self.variance_epsilon,
+        )
+    def extra_repr(self):
+        return f"{self.hidden_size}, num_channels={self.num_channels}, num_groups={self.num_groups}, eps={self.eps}"

{liger_kernel_nightly-0.4.0.dev20241107194223 → liger_kernel_nightly-0.4.0.dev20241108173850/src/liger_kernel_nightly.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: liger_kernel_nightly
-Version: 0.4.0.dev20241107194223
+Version: 0.4.0.dev20241108173850
 Summary: Efficient Triton kernels for LLM Training
 License: BSD 2-CLAUSE LICENSE
         Copyright 2024 LinkedIn Corporation
@@ -332,7 +332,7 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$
 - **FusedLinearCrossEntropy**: Peak memory usage of cross entropy loss is further improved by fusing the model head with the CE loss and chunking the input for block-wise loss and gradient calculation, a technique inspired by [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy). It achieves >4X memory reduction for 128k vocab size. **This is highly effective for large batch size, large sequence length, and large vocabulary sizes.** Please refer to the [Medusa example](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) for individual kernel usage.
 - **KLDivergence**: [KL Divergence](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) is implemented by fusing the forward into a single triton kernel, with reduction done outside the kernel. It achieves ~1.5X speed and ~15% memory reduction for 128K vocab size.
 - **JSD**: [Generalized JSD](https://arxiv.org/pdf/2306.13649) (Jensen-Shannon divergence), is implemented by computing both the loss and gradient in the forward pass. It achieves ~1.5X speed and ~54% memory reduction for 128k vocab size.
-- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192.
+- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192.
 ### Experimental Kernels

{liger_kernel_nightly-0.4.0.dev20241107194223 → liger_kernel_nightly-0.4.0.dev20241108173850}/src/liger_kernel_nightly.egg-info/SOURCES.txt RENAMED Viewed

@@ -8,6 +8,7 @@ src/liger_kernel/ops/cross_entropy.py
 src/liger_kernel/ops/fused_linear_cross_entropy.py
 src/liger_kernel/ops/fused_linear_jsd.py
 src/liger_kernel/ops/geglu.py
+src/liger_kernel/ops/group_norm.py
 src/liger_kernel/ops/jsd.py
 src/liger_kernel/ops/kl_div.py
 src/liger_kernel/ops/layer_norm.py
@@ -24,6 +25,7 @@ src/liger_kernel/transformers/functional.py
 src/liger_kernel/transformers/fused_linear_cross_entropy.py
 src/liger_kernel/transformers/fused_linear_jsd.py
 src/liger_kernel/transformers/geglu.py
+src/liger_kernel/transformers/group_norm.py
 src/liger_kernel/transformers/jsd.py
 src/liger_kernel/transformers/kl_div.py
 src/liger_kernel/transformers/layer_norm.py