PyPI - adafactor8bit - Versions diffs - 0.2.2__tar.gz → 0.2.5__tar.gz - Mend

adafactor8bit 0.2.2tar.gz → 0.2.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

{adafactor8bit-0.2.2/adafactor8bit.egg-info → adafactor8bit-0.2.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: adafactor8bit
-Version: 0.2.2
+Version: 0.2.5
 Summary: 8-bit Adafactor Optimizer with Fused CUDA Kernels
 Home-page: https://github.com/yanfeiwong/adafactor-8bit
 Author: WANG YAN
@@ -53,7 +53,7 @@ An enhanced 8-bit Adafactor optimizer featuring fused CUDA kernels, log-space bl
 - **Log-Space Quantization**: Maps the second moment (variance) to the log2 space before 8-bit quantization. This approach accommodates the long-tail distribution of variances, reducing the risk of small second-moment estimates being truncated to zero and improving overall training stability.
 - **Fused CUDA Kernels**: Combines dequantization, EMA updates, Warp-Shuffle reductions, and requantization into single kernels. It utilizes `float4` vectorization to optimize memory bandwidth usage.
-- **Optional 4-bit Packed First Moment**: Stores the first moment (`beta1`) in a physically packed 4-bit format when enabled, providing momentum with minimal additional memory overhead.
+- **Optional NF4 First Moment**: Stores the optional first moment (`beta1`) using Normal Float 4-bit (NF4) non-uniform quantization, preserving small momentum updates while keeping memory overhead minimal.
 - **CAME Confidence Guidance**: Optional Confidence-guided Adaptive Memory Efficient Optimization (CAME) that estimates update confidence from historical momentum and adaptively suppresses unstable update directions, improving training stability and reducing loss spikes.
 - **APOLLO Subspace Projection**: Opt-in random subspace projection that estimates adaptive gradient scaling in a low-rank space, preventing stale second-moment statistics and potentially improving convergence and generalization.
 - **Fira Norm-Growth Limiter**: Suppresses destructive gradient spikes by regulating the relative increase of update norms. Originally used for the APOLLO path, it is now available for the standard Adafactor path as well. It improves training stability and often allows the safe removal of external gradient clipping.
@@ -278,7 +278,7 @@ Enable the APOLLO path to compute gradient scaling factors in a memory-efficient
 - **`apollo_factorize` (Experimental)**: Applies Adafactor's row/column factorization within the low-rank subspace. Mathematically, this leverages the norm-preserving property of random projections to approximate the variance of the primary dimension, while the secondary dimension's variance is estimated across random bases, introducing inherent noise. This dual-compression mechanism drastically reduces optimizer state overhead. Note that for smaller models, the actual VRAM savings might be marginal, and the introduced noise could impact convergence stability. Use with caution.
 - **Fira Limiter Integration**: The APOLLO path automatically applies the Fira Norm-Growth Limiter to the scaled gradients to prevent sudden gradient rises from causing loss spikes. You can adjust its sensitivity using the global `fira_margin` parameter.
-## 🛡️ CAME Confidence-Guided Updates
+## 🧊 CAME Confidence-Guided Updates
 Enable the CAME (Confidence-guided Adaptive Memory Efficient Optimization) path to add a confidence estimation stage after momentum accumulation:
@@ -326,24 +326,25 @@ If you are migrating from optimizers like AdamW, Adafactor's learning rate behav
 *These are safe starting points. Always validate on your own task and batch size.*
 ## 🎓 Acknowledgements
-Thanks to **Noam Shazeer** and **Mitchell Stern** for proposing the original Adafactor algorithm in the paper [Adafactor: Adaptive Learning Rates with Sublinear Memory Cost](https://arxiv.org/abs/1804.04235).
-Thanks to **Tim Dettmers** for the inspiration from the paper [8-BIT OPTIMIZERS VIA BLOCK-WISE QUANTIZATION](https://arxiv.org/abs/2110.02861) and the [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) library.
-Thanks to **Hanqing Zhu**, **Zhenyu Zhang**, and the team for proposing the approximated gradient scaling method in the paper [APOLLO: SGD-Like Memory, AdamW-level Performance](https://arxiv.org/abs/2412.05270).
+This project builds upon the foundational work of several researchers and open-source communities. Sincere thanks to the following for their invaluable contributions:
-Thanks to **Xi Chen**, **Kaituo Feng**, and the team for the Norm-Growth Limiter mechanism introduced in [Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?](https://arxiv.org/abs/2410.01623).
+### Core Algorithm & Optimizer Design
+- **Noam Shazeer & Mitchell Stern** for proposing the original **Adafactor** algorithm ([Adafactor: Adaptive Learning Rates with Sublinear Memory Cost](https://arxiv.org/abs/1804.04235)).
+- **Tim Dettmers** for the inspiration from **8-bit block-wise quantization** ([8-BIT OPTIMIZERS VIA BLOCK-WISE QUANTIZATION](https://arxiv.org/abs/2110.02861)) and the [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) library.
+- **Hanqing Zhu, Zhenyu Zhang, et al.** for the **APOLLO** algorithm ([APOLLO: SGD-Like Memory, AdamW-level Performance](https://arxiv.org/abs/2412.05270)).
+- **Xi Chen, Kaituo Feng, et al.** for the **Norm-Growth Limiter** mechanism in **Fira** ([Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?](https://arxiv.org/abs/2410.01623)).
+- **Yang Luo, et al.** for the **confidence-guided strategy** in **CAME** ([CAME: Confidence-guided Adaptive Memory Efficient Optimization](https://arxiv.org/abs/2307.02047)).
-Thanks to **Yang Luo** and the team for proposing the confidence-guided strategy in the paper [CAME: Confidence-guided Adaptive Memory Efficient Optimization](https://arxiv.org/abs/2307.02047).
+### Quantization & Implementation
+- **The QLoRA Team** for pioneering the **4-bit NormalFloat (NF4)** quantization format ([QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)) that inspired our first moment quantization.
+- **The PyTorch AO Team** for their work on [4-bit optimizer states](https://github.com/pytorch/ao/tree/main/torchao/optim), validating distribution-aware quantization for optimizer moments.
+- **The PyTorch Team** for providing the foundational optimizer implementation and the C++ Extension toolchain.
-Thanks to the **PyTorch team** for providing the foundational Optimizer implementation and the C++ Extension toolchain.
+### Technical Review & Discussion
+- **Qwen, ChatGLM, and DeepSeek** (large language models) for valuable technical discussions and code reviews on CUDA low-level optimization, memory safety mechanisms, and cross-platform compilation pipeline design.
-Thanks to the large language models **Qwen**, **ChatGLM** and **DeepSeek** for valuable technical discussions and code reviews on CUDA low-level optimization and memory safety mechanisms.
 ## 🏛️ License

{adafactor8bit-0.2.2 → adafactor8bit-0.2.5}/README.md RENAMED Viewed

@@ -26,7 +26,7 @@ An enhanced 8-bit Adafactor optimizer featuring fused CUDA kernels, log-space bl
 - **Log-Space Quantization**: Maps the second moment (variance) to the log2 space before 8-bit quantization. This approach accommodates the long-tail distribution of variances, reducing the risk of small second-moment estimates being truncated to zero and improving overall training stability.
 - **Fused CUDA Kernels**: Combines dequantization, EMA updates, Warp-Shuffle reductions, and requantization into single kernels. It utilizes `float4` vectorization to optimize memory bandwidth usage.
-- **Optional 4-bit Packed First Moment**: Stores the first moment (`beta1`) in a physically packed 4-bit format when enabled, providing momentum with minimal additional memory overhead.
+- **Optional NF4 First Moment**: Stores the optional first moment (`beta1`) using Normal Float 4-bit (NF4) non-uniform quantization, preserving small momentum updates while keeping memory overhead minimal.
 - **CAME Confidence Guidance**: Optional Confidence-guided Adaptive Memory Efficient Optimization (CAME) that estimates update confidence from historical momentum and adaptively suppresses unstable update directions, improving training stability and reducing loss spikes.
 - **APOLLO Subspace Projection**: Opt-in random subspace projection that estimates adaptive gradient scaling in a low-rank space, preventing stale second-moment statistics and potentially improving convergence and generalization.
 - **Fira Norm-Growth Limiter**: Suppresses destructive gradient spikes by regulating the relative increase of update norms. Originally used for the APOLLO path, it is now available for the standard Adafactor path as well. It improves training stability and often allows the safe removal of external gradient clipping.
@@ -251,7 +251,7 @@ Enable the APOLLO path to compute gradient scaling factors in a memory-efficient
 - **`apollo_factorize` (Experimental)**: Applies Adafactor's row/column factorization within the low-rank subspace. Mathematically, this leverages the norm-preserving property of random projections to approximate the variance of the primary dimension, while the secondary dimension's variance is estimated across random bases, introducing inherent noise. This dual-compression mechanism drastically reduces optimizer state overhead. Note that for smaller models, the actual VRAM savings might be marginal, and the introduced noise could impact convergence stability. Use with caution.
 - **Fira Limiter Integration**: The APOLLO path automatically applies the Fira Norm-Growth Limiter to the scaled gradients to prevent sudden gradient rises from causing loss spikes. You can adjust its sensitivity using the global `fira_margin` parameter.
-## 🛡️ CAME Confidence-Guided Updates
+## 🧊 CAME Confidence-Guided Updates
 Enable the CAME (Confidence-guided Adaptive Memory Efficient Optimization) path to add a confidence estimation stage after momentum accumulation:
@@ -299,24 +299,25 @@ If you are migrating from optimizers like AdamW, Adafactor's learning rate behav
 *These are safe starting points. Always validate on your own task and batch size.*
 ## 🎓 Acknowledgements
-Thanks to **Noam Shazeer** and **Mitchell Stern** for proposing the original Adafactor algorithm in the paper [Adafactor: Adaptive Learning Rates with Sublinear Memory Cost](https://arxiv.org/abs/1804.04235).
-Thanks to **Tim Dettmers** for the inspiration from the paper [8-BIT OPTIMIZERS VIA BLOCK-WISE QUANTIZATION](https://arxiv.org/abs/2110.02861) and the [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) library.
-Thanks to **Hanqing Zhu**, **Zhenyu Zhang**, and the team for proposing the approximated gradient scaling method in the paper [APOLLO: SGD-Like Memory, AdamW-level Performance](https://arxiv.org/abs/2412.05270).
+This project builds upon the foundational work of several researchers and open-source communities. Sincere thanks to the following for their invaluable contributions:
-Thanks to **Xi Chen**, **Kaituo Feng**, and the team for the Norm-Growth Limiter mechanism introduced in [Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?](https://arxiv.org/abs/2410.01623).
+### Core Algorithm & Optimizer Design
+- **Noam Shazeer & Mitchell Stern** for proposing the original **Adafactor** algorithm ([Adafactor: Adaptive Learning Rates with Sublinear Memory Cost](https://arxiv.org/abs/1804.04235)).
+- **Tim Dettmers** for the inspiration from **8-bit block-wise quantization** ([8-BIT OPTIMIZERS VIA BLOCK-WISE QUANTIZATION](https://arxiv.org/abs/2110.02861)) and the [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) library.
+- **Hanqing Zhu, Zhenyu Zhang, et al.** for the **APOLLO** algorithm ([APOLLO: SGD-Like Memory, AdamW-level Performance](https://arxiv.org/abs/2412.05270)).
+- **Xi Chen, Kaituo Feng, et al.** for the **Norm-Growth Limiter** mechanism in **Fira** ([Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?](https://arxiv.org/abs/2410.01623)).
+- **Yang Luo, et al.** for the **confidence-guided strategy** in **CAME** ([CAME: Confidence-guided Adaptive Memory Efficient Optimization](https://arxiv.org/abs/2307.02047)).
-Thanks to **Yang Luo** and the team for proposing the confidence-guided strategy in the paper [CAME: Confidence-guided Adaptive Memory Efficient Optimization](https://arxiv.org/abs/2307.02047).
+### Quantization & Implementation
+- **The QLoRA Team** for pioneering the **4-bit NormalFloat (NF4)** quantization format ([QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)) that inspired our first moment quantization.
+- **The PyTorch AO Team** for their work on [4-bit optimizer states](https://github.com/pytorch/ao/tree/main/torchao/optim), validating distribution-aware quantization for optimizer moments.
+- **The PyTorch Team** for providing the foundational optimizer implementation and the C++ Extension toolchain.
-Thanks to the **PyTorch team** for providing the foundational Optimizer implementation and the C++ Extension toolchain.
+### Technical Review & Discussion
+- **Qwen, ChatGLM, and DeepSeek** (large language models) for valuable technical discussions and code reviews on CUDA low-level optimization, memory safety mechanisms, and cross-platform compilation pipeline design.
-Thanks to the large language models **Qwen**, **ChatGLM** and **DeepSeek** for valuable technical discussions and code reviews on CUDA low-level optimization and memory safety mechanisms.
 ## 🏛️ License

{adafactor8bit-0.2.2 → adafactor8bit-0.2.5}/adafactor8bit/kernels.cu RENAMED Viewed

@@ -7,7 +7,26 @@
 __device__ constexpr float INV_255 = 1.0f / 255.0f;
 __device__ constexpr float MIN_LOG = -126.0f;
 __device__ constexpr float MIN_VAL = 1.17549435e-38f;
-__device__ constexpr float INV_7 = 1.0f / 7.0f;
+__constant__ float NF4_QMAP[16] = {
+    -1.0f, -0.6961928f, -0.52507306f, -0.3895074f,
+    -0.27408478f, -0.17286907f, -0.07958022f, 0.0f,
+    0.07958022f, 0.17286907f, 0.27408478f, 0.3895074f,
+    0.52507306f, 0.6961928f, 0.8641379f, 1.0f
+};
+__device__ __forceinline__ int find_nearest_nf4(float x) {
+    float x_abs = fabsf(x);
+    if (x_abs < 0.0397901f) return 7;
+    if (x_abs < 0.1262246f) return (x >= 0.0f) ? 8 : 6;
+    if (x_abs < 0.2234769f) return (x >= 0.0f) ? 9 : 5;
+    if (x_abs < 0.3317961f) return (x >= 0.0f) ? 10 : 4;
+    if (x_abs < 0.4572902f) return (x >= 0.0f) ? 11 : 3;
+    if (x_abs < 0.6106329f) return (x >= 0.0f) ? 12 : 2;
+    if (x_abs < 0.7801653f) return (x >= 0.0f) ? 13 : 1;
+    if (x_abs < 0.9320689f) return (x >= 0.0f) ? 14 : 0;
+    return (x >= 0.0f) ? 15 : 0;
+}
 // ==========================================
 // 1. Fused Log-Quantize Lerp (EMA Update for V_t)
@@ -120,7 +139,7 @@ __global__ void fused_log_quantize_lerp_kernel(
     }
     __syncthreads();
-    float max_log = fminf(fmaxf(s_max[0], MIN_LOG + 1e-12f), 50.0f);
+    float max_log = fminf(fmaxf(s_max[0], MIN_LOG + 1e-12f), 126.0f);
     float new_scale = max_log - MIN_LOG;
     float inv_scale = 255.0f / (max_log - MIN_LOG);
@@ -226,11 +245,10 @@ __global__ void fused_4bit_quantize_lerp_kernel(
         uchar2 old_q = q_vec[idx];
-        // Unpack old m_t (biased by +8 for unsigned storage)
-        float m_old0 = (float)((old_q.x >> 4) - 8) * old_scale;
-        float m_old1 = (float)((old_q.x & 0x0F) - 8) * old_scale;
-        float m_old2 = (float)((old_q.y >> 4) - 8) * old_scale;
-        float m_old3 = (float)((old_q.y & 0x0F) - 8) * old_scale;
+        float m_old0 = NF4_QMAP[(old_q.x >> 4)] * old_scale;
+        float m_old1 = NF4_QMAP[(old_q.x & 0x0F)] * old_scale;
+        float m_old2 = NF4_QMAP[(old_q.y >> 4)] * old_scale;
+        float m_old3 = NF4_QMAP[(old_q.y & 0x0F)] * old_scale;
         float m_new0 = beta * m_old0 + one_minus_b * val_x;
         float m_new1 = beta * m_old1 + one_minus_b * val_y;
@@ -267,7 +285,7 @@ __global__ void fused_4bit_quantize_lerp_kernel(
     __syncthreads();
     float abs_max = fmaxf(s_max[0], 1e-12f);
-    float new_scale = abs_max * INV_7;
+    float new_scale = abs_max;
     float inv_scale = 1.0f / new_scale;
     for (int i = 0; i < vec_iters; i++) {
@@ -279,12 +297,10 @@ __global__ void fused_4bit_quantize_lerp_kernel(
         float m2 = local_m[idx * 4 + 2];
         float m3 = local_m[idx * 4 + 3];
-        // Pure integer clamping, then bias by +8 for unsigned 4-bit packing
-        // Using standard 4-bit signed range [-8, 7] mapping to [0, 15]
-        int q0 = max(-8, min(7, __float2int_rn(m0 * inv_scale))) + 8;
-        int q1 = max(-8, min(7, __float2int_rn(m1 * inv_scale))) + 8;
-        int q2 = max(-8, min(7, __float2int_rn(m2 * inv_scale))) + 8;
-        int q3 = max(-8, min(7, __float2int_rn(m3 * inv_scale))) + 8;
+        int q0 = find_nearest_nf4(m0 * inv_scale);
+        int q1 = find_nearest_nf4(m1 * inv_scale);
+        int q2 = find_nearest_nf4(m2 * inv_scale);
+        int q3 = find_nearest_nf4(m3 * inv_scale);
         uchar2 out_q;
         out_q.x = (unsigned char)((q0 << 4) | q1);
@@ -349,7 +365,8 @@ __global__ void compute_update_norm_2d_kernel(
         max_log = fmaxf(max_log, -53.0f);
         float inv_std = exp2f(-0.5f * max_log);
-        float u_ij = grad[idx] * inv_std;
+        float g_val = (isnan(grad[idx]) || isinf(grad[idx])) ? 0.0f : grad[idx];
+        float u_ij = g_val * inv_std;
         sq += u_ij * u_ij;
     }
@@ -418,7 +435,8 @@ __global__ void apply_update_2d_kernel(
         max_log = fmaxf(max_log, -53.0f);
         float inv_std = exp2f(-0.5f * max_log);
-        float u_ij = grad[idx] * inv_std;
+        float g_val = (isnan(grad[idx]) || isinf(grad[idx])) ? 0.0f : grad[idx];
+        float u_ij = g_val * inv_std;
         float p_val = static_cast<float>(param[idx]);
         p_val -= step_size * u_ij;
@@ -464,7 +482,8 @@ __global__ void compute_update_norm_1d_kernel(
         max_log = fmaxf(max_log, -53.0f);
         float inv_std = exp2f(-0.5f * max_log);
-        float u_val = grad[idx] * inv_std;
+        float g_val = (isnan(grad[idx]) || isinf(grad[idx])) ? 0.0f : grad[idx];
+        float u_val = g_val * inv_std;
         sq += u_val * u_val;
     }
@@ -516,7 +535,8 @@ __global__ void apply_update_1d_kernel(
         max_log = fmaxf(max_log, -53.0f);
         float inv_std = exp2f(-0.5f * max_log);
-        float u_val = grad[idx] * inv_std;
+        float g_val = (isnan(grad[idx]) || isinf(grad[idx])) ? 0.0f : grad[idx];
+        float u_val = g_val * inv_std;
         float p_val = static_cast<float>(param[idx]);
         p_val -= step_size * u_val;
@@ -561,7 +581,7 @@ __global__ void compute_update_norm_m_2d_kernel(
         // Unpack 4-bit m_t
         unsigned char packed = m_q[idx / 2];
         int q_int = (idx & 1) ? (packed & 0x0F) : (packed >> 4);
-        float m_val = (float)(q_int - 8) * m_scale[idx / m_block_size];
+        float m_val = NF4_QMAP[q_int] * m_scale[idx / m_block_size];
         float log_r = (float)row_var_q[b * R + r] * INV_255 * row_var_scale[(b * R + r) / v_block_size] + MIN_LOG;
         float log_c = (float)col_var_q[b * C + c] * INV_255 * col_var_scale[(b * C + c) / v_block_size] + MIN_LOG;
@@ -636,7 +656,7 @@ __global__ void apply_update_m_2d_kernel(
         // Unpack 4-bit m_t
         unsigned char packed = m_q[idx / 2];
         int q_int = (idx & 1) ? (packed & 0x0F) : (packed >> 4);
-        float m_val = (float)(q_int - 8) * m_scale[idx / m_block_size];
+        float m_val = NF4_QMAP[q_int] * m_scale[idx / m_block_size];
         float log_r = (float)row_var_q[b * R + r] * INV_255 * row_var_scale[(b * R + r) / v_block_size] + MIN_LOG;
         float log_c = (float)col_var_q[b * C + c] * INV_255 * col_var_scale[(b * C + c) / v_block_size] + MIN_LOG;
@@ -692,7 +712,7 @@ __global__ void compute_update_norm_m_1d_kernel(
         // Unpack 4-bit m_t
         unsigned char packed = m_q[idx / 2];
         int q_int = (idx & 1) ? (packed & 0x0F) : (packed >> 4);
-        float m_val = (float)(q_int - 8) * m_scale[idx / m_block_size];
+        float m_val = NF4_QMAP[q_int] * m_scale[idx / m_block_size];
         float log_v = (float)variance_q[idx] * INV_255 * variance_scale[idx / v_block_size] + MIN_LOG;
@@ -751,7 +771,7 @@ __global__ void apply_update_m_1d_kernel(
         // Unpack 4-bit m_t
         unsigned char packed = m_q[idx / 2];
         int q_int = (idx & 1) ? (packed & 0x0F) : (packed >> 4);
-        float m_val = (float)(q_int - 8) * m_scale[idx / m_block_size];
+        float m_val = NF4_QMAP[q_int] * m_scale[idx / m_block_size];
         float log_v = (float)variance_q[idx] * INV_255 * variance_scale[idx / v_block_size] + MIN_LOG;
@@ -808,7 +828,7 @@ __global__ void compute_apollo_norms_kernel(
         // 4-bit m 解包
         unsigned char m_byte = m_q[global_idx / 2];
         int m_int = (global_idx & 1) ? (m_byte & 0x0F) : (m_byte >> 4);
-        float m_val = ((float)m_int - 8.0f) * m_scale[global_idx / m_block_size];
+        float m_val = NF4_QMAP[m_int] * m_scale[global_idx / m_block_size];
         // 8-bit log v 解包
         unsigned char v_byte = v_q[global_idx];
         float log_v = (float)v_byte * INV_255 * v_scale[global_idx / v_block_size] + MIN_LOG;
@@ -892,7 +912,7 @@ __global__ void dequantize_4bit_kernel(
     if (idx >= numel) return;
     unsigned char packed = q[idx / 2];
     int q_int = (idx & 1) ? (packed & 0x0F) : (packed >> 4);
-    output[idx] = (float)(q_int - 8) * scale[idx / block_size];
+    output[idx] = NF4_QMAP[q_int] * scale[idx / block_size];
 }
 void dequantize_4bit_cuda(
@@ -922,7 +942,7 @@ __global__ void compute_update_norm_1d_full_kernel(
     float one_minus_b = 1.0f - beta;
     for (int idx = blockIdx.x * blockDim.x + threadIdx.x; idx < numel; idx += stride) {
-        float g = grad[idx];
+        float g = (isnan(grad[idx]) || isinf(grad[idx])) ? 0.0f : grad[idx];
         float g2 = g * g;
         float v = one_minus_b * variance[idx] + beta * g2;
         variance[idx] = v;
@@ -979,7 +999,7 @@ __global__ void apply_update_1d_full_kernel(
     int stride = gridDim.x * blockDim.x;
     for (int idx = blockIdx.x * blockDim.x + threadIdx.x; idx < numel; idx += stride) {
-        float g = grad[idx];
+        float g = (isnan(grad[idx]) || isinf(grad[idx])) ? 0.0f : grad[idx];
         float v = variance[idx];
         float inv_std = rsqrtf(fmaxf(v, eps_sq));
         float u = g * inv_std;
@@ -1023,7 +1043,7 @@ __global__ void compute_update_norm_1d_full_m_kernel(
     float one_minus_bv = 1.0f - beta_val;
     for (int idx = blockIdx.x * blockDim.x + threadIdx.x; idx < numel; idx += stride) {
-        float g = grad[idx];
+        float g = (isnan(grad[idx]) || isinf(grad[idx])) ? 0.0f : grad[idx];
         float g2 = g * g;
         float v = one_minus_bv * variance[idx] + beta_val * g2;
@@ -1132,7 +1152,7 @@ __global__ void came_compute_residual_2d_kernel(
         unsigned char packed = m_q[idx / 2];
         int q_int = (idx & 1) ? (packed & 0x0F) : (packed >> 4);
-        float m_val = (float)(q_int - 8) * m_scale[idx / m_block_size];
+        float m_val = NF4_QMAP[q_int] * m_scale[idx / m_block_size];
         float log_r = (float)row_var_q[b * R + r] * INV_255 * row_var_scale[(b * R + r) / v_block_size] + MIN_LOG;
         float log_c = (float)col_var_q[b * C + c] * INV_255 * col_var_scale[(b * C + c) / v_block_size] + MIN_LOG;
@@ -1143,7 +1163,8 @@ __global__ void came_compute_residual_2d_kernel(
         max_log = fmaxf(max_log, -53.0f);
         float inv_std = exp2f(-0.5f * max_log);
-        float diff = (grad[idx] - m_val) * inv_std;
+        float g_val = (isnan(grad[idx]) || isinf(grad[idx])) ? 0.0f : grad[idx];
+        float diff = (g_val - m_val) * inv_std;
         float res = diff * diff;
         atomicAdd(&res_col_sum[b * C + c], res);

{adafactor8bit-0.2.2 → adafactor8bit-0.2.5}/adafactor8bit/optimizer.py RENAMED Viewed

@@ -21,6 +21,14 @@ _FP32_MIN_LOG = -126.0
 _INV_255 = 1.0 / 255.0
 _INV_7 = 1.0 / 7.0
+_NF4_TABLE = torch.tensor([
+    -1.0, -0.6961928, -0.52507306, -0.3895074,
+    -0.27408478, -0.17286907, -0.07958022, 0.0,
+    0.07958022, 0.17286907, 0.27408478, 0.3895074,
+    0.52507306, 0.6961928, 0.8641379, 1.0
+], dtype=torch.float32)
 # ==========================================
 # 1. CUDA Kernel JIT Loading
 # ==========================================
@@ -127,7 +135,7 @@ def _log_dequantize_nonneg(q: Tensor, scale: Tensor, shape: torch.Size, pad: int
     return flat.view(shape)
 def _quantize_4bit_pytorch(m: Tensor, block_size: int) -> Tuple[Tensor, Tensor]:
-    """4-bit symmetric min-max quantization with physical packing into uint8."""
+    """4-bit NF4 non-uniform quantization with physical packing into uint8."""
     flat = m.flatten()
     pad = (block_size - flat.numel() % block_size) % block_size
     if pad:
@@ -135,12 +143,15 @@ def _quantize_4bit_pytorch(m: Tensor, block_size: int) -> Tuple[Tensor, Tensor]:
     blocks = flat.view(-1, block_size)
     abs_max = blocks.abs().amax(dim=1, keepdim=True).clamp(min=1e-12)
-    scale = abs_max * _INV_7  # _INV_7 = 1.0 / 7.0
+    scale = abs_max
-    q = (torch.round(blocks / scale).clamp(-8, 7) + 8).to(torch.uint8)
+    normalized = blocks / scale
+    table = _NF4_TABLE.to(normalized.device)
+    diff = (normalized.unsqueeze(-1) - table).abs()
+    codes = diff.argmin(dim=-1).to(torch.uint8)
-    q_even = q[:, 0::2]
-    q_odd = q[:, 1::2]
+    q_even = codes[:, 0::2]
+    q_odd = codes[:, 1::2]
     packed = (q_even << 4) | q_odd
     return packed.view(-1), scale.squeeze(-1)
@@ -152,11 +163,12 @@ def _dequantize_4bit(m_q: Tensor, m_scale: Tensor, numel: int, shape: torch.Size
         _CUDA_MODULE.dequantize_4bit(output, m_q, m_scale, numel, block_size)
         return output.view(shape)
     else:
-        high = ((m_q >> 4) & 0x0F).to(torch.float32) - 8.0
-        low = (m_q & 0x0F).to(torch.float32) - 8.0
-        m_flat = torch.stack((high, low), dim=-1).view(-1)
-        m_blocks = m_flat.view(-1, block_size)
-        result = (m_blocks * m_scale.unsqueeze(-1)).view(-1)[:numel]
+        high = (m_q >> 4)
+        low = (m_q & 0x0F)
+        codes = torch.stack((high, low), dim=-1).view(-1)
+        m_blocks = codes.view(-1, block_size)
+        table = _NF4_TABLE.to(m_q.device)
+        result = (table[m_blocks.long()] * m_scale.unsqueeze(-1)).view(-1)[:numel]
         return result.view(shape)
 # ==========================================
@@ -461,20 +473,21 @@ class Adafactor8Bit(Optimizer):
                             if beta1 is not None:
                                 m_padded_numel = ((p.numel() + m_block_size - 1) // m_block_size) * m_block_size
-                                state["m_q"] = torch.full((m_padded_numel // 2,), 0x88, dtype=torch.uint8, device=p.device)
+                                state["m_q"] = torch.full((m_padded_numel // 2,), 0x77, dtype=torch.uint8, device=p.device)
                                 state["m_scale"] = torch.ones(m_padded_numel // m_block_size, dtype=torch.float32, device=p.device)
                                 state["m_block_size"] = m_block_size
                             if beta3 is not None:
-                                state["conf_row_q"] = torch.zeros_like(state["row_var_q"])
-                                state["conf_row_scale"] = torch.ones_like(state["row_var_scale"])
+                                state["conf_row_q"] = torch.full_like(state["row_var_q"], 255)
+                                state["conf_row_scale"] = torch.full_like(state["row_var_scale"], 126.0)
                                 state["conf_row_shape"] = state["row_var_shape"]
                                 state["conf_row_pad"] = state["row_var_pad"]
-                                state["conf_col_q"] = torch.zeros_like(state["col_var_q"])
-                                state["conf_col_scale"] = torch.ones_like(state["col_var_scale"])
+                                state["conf_col_q"] = torch.full_like(state["col_var_q"], 255)
+                                state["conf_col_scale"] = torch.full_like(state["col_var_scale"], 126.0)
                                 state["conf_col_shape"] = state["col_var_shape"]
                                 state["conf_col_pad"] = state["col_var_pad"]
                         else:
                             state["row_var"] = torch.zeros(r_shape, dtype=torch.float32, device=p.device)
                             state["col_var"] = torch.zeros(c_shape, dtype=torch.float32, device=p.device)
@@ -494,7 +507,7 @@ class Adafactor8Bit(Optimizer):
                             if beta1 is not None:
                                 m_padded_numel = ((p.numel() + m_block_size - 1) // m_block_size) * m_block_size
-                                state["m_q"] = torch.full((m_padded_numel // 2,), 0x88, dtype=torch.uint8, device=p.device)
+                                state["m_q"] = torch.full((m_padded_numel // 2,), 0x77, dtype=torch.uint8, device=p.device)
                                 state["m_scale"] = torch.ones(m_padded_numel // m_block_size, dtype=torch.float32, device=p.device)
                                 state["m_block_size"] = m_block_size
                         else:
@@ -566,7 +579,7 @@ class Adafactor8Bit(Optimizer):
                             state["m_q"], state["m_scale"] = _quantize_4bit_pytorch(state["m"], m_curr_block_size)
                             state.pop("m")
                         else:
-                            state["m_q"] = torch.full((m_padded_numel // 2,), 0x88, dtype=torch.uint8, device=p.device)
+                            state["m_q"] = torch.full((m_padded_numel // 2,), 0x77, dtype=torch.uint8, device=p.device)
                             state["m_scale"] = torch.ones(m_padded_numel // m_curr_block_size, dtype=torch.float32, device=p.device)
                         state["m_block_size"] = m_curr_block_size
@@ -620,7 +633,7 @@ class Adafactor8Bit(Optimizer):
                         state["m_q"], state["m_scale"] = _quantize_4bit_pytorch(state["m"], m_curr_block_size)
                         state.pop("m")
                     else:
-                        state["m_q"] = torch.full((m_padded_numel // 2,), 0x88, dtype=torch.uint8, device=p.device)
+                        state["m_q"] = torch.full((m_padded_numel // 2,), 0x77, dtype=torch.uint8, device=p.device)
                         state["m_scale"] = torch.ones(m_padded_numel // m_curr_block_size, dtype=torch.float32, device=p.device)
                     state["m_block_size"] = m_curr_block_size
@@ -697,6 +710,11 @@ class Adafactor8Bit(Optimizer):
 # ==========================================
 def _apply_fira_cuda(state: Dict[str, Any], total_sum_sq: Tensor, alpha: Tensor, fira_margin: float) -> Tuple[Tensor, Tensor]:
     current_norm = total_sum_sq.sqrt().view([])
+    is_finite = torch.isfinite(current_norm)
+    current_norm = torch.where(is_finite, current_norm, torch.zeros_like(current_norm))
+    total_sum_sq = torch.where(is_finite, total_sum_sq, torch.zeros_like(total_sum_sq))
     fira_threshold = 1.0 + fira_margin
     prev_norm = state.get("fira_prev_norm", None)
@@ -704,13 +722,14 @@ def _apply_fira_cuda(state: Dict[str, Any], total_sum_sq: Tensor, alpha: Tensor,
         if not isinstance(prev_norm, Tensor):
             prev_norm = torch.tensor(prev_norm, device=total_sum_sq.device, dtype=torch.float32)
+        is_reset = prev_norm < 1e-6
         ratio = current_norm / (prev_norm + 1e-8)
         limiter = torch.clamp_min(ratio, fira_threshold) / fira_threshold
-        final_scale = 1.0 / limiter
+        final_scale = torch.where(is_reset, torch.ones_like(current_norm), 1.0 / limiter)
+        state["fira_prev_norm"] = torch.where(is_reset, current_norm, current_norm * final_scale)
     else:
         final_scale = torch.tensor(1.0, device=total_sum_sq.device, dtype=torch.float32)
-    state["fira_prev_norm"] = current_norm * final_scale
+        state["fira_prev_norm"] = current_norm
     alpha_scaled = alpha * final_scale
     total_sum_sq.mul_(final_scale.square())
@@ -719,19 +738,26 @@ def _apply_fira_cuda(state: Dict[str, Any], total_sum_sq: Tensor, alpha: Tensor,
 def _apply_fira_pytorch(state: Dict[str, Any], update: Tensor, fira_margin: float, numel: int, d: float) -> Tuple[Tensor, Tensor]:
     current_norm = torch.linalg.vector_norm(update)
+    is_finite = torch.isfinite(current_norm)
+    current_norm = torch.where(is_finite, current_norm, torch.zeros_like(current_norm))
+    update = torch.where(is_finite, update, torch.zeros_like(update))
     fira_threshold = 1.0 + fira_margin
     prev_norm = state.get("fira_prev_norm", None)
     if prev_norm is not None:
         if not isinstance(prev_norm, Tensor):
             prev_norm = torch.tensor(prev_norm, device=update.device, dtype=torch.float32)
+        is_reset = prev_norm < 1e-6
         ratio = current_norm / (prev_norm + 1e-8)
         limiter = torch.clamp_min(ratio, fira_threshold) / fira_threshold
-        final_scale = 1.0 / limiter
+        final_scale = torch.where(is_reset, torch.ones_like(current_norm), 1.0 / limiter)
+        state["fira_prev_norm"] = torch.where(is_reset, current_norm, current_norm * final_scale)
     else:
         final_scale = torch.tensor(1.0, device=update.device, dtype=torch.float32)
-    state["fira_prev_norm"] = current_norm * final_scale
+        state["fira_prev_norm"] = current_norm
     update_scaled = update * final_scale
     norm_final = current_norm * final_scale
@@ -1295,6 +1321,7 @@ def _update_param_apollo(
     fira_margin: float = 0.01,
 ):
     grad_work = grad.neg().float() if maximize else grad.float()
+    grad_work = torch.where(torch.isfinite(grad_work), grad_work, torch.zeros_like(grad_work))
     update_low = None
     if apollo_factorize:
@@ -1401,7 +1428,7 @@ def _update_param_apollo(
                 if state.get("m_low_q") is None:
                     m_padded_numel = ((grad_low_numel + m_curr_block_size - 1) // m_curr_block_size) * m_curr_block_size
-                    state["m_low_q"] = torch.full((m_padded_numel // 2,), 0x88, dtype=torch.uint8, device=grad_low.device)
+                    state["m_low_q"] = torch.full((m_padded_numel // 2,), 0x77, dtype=torch.uint8, device=grad_low.device)
                     state["m_low_scale"] = torch.ones(m_padded_numel // m_curr_block_size, dtype=torch.float32, device=grad_low.device)
                     state["m_block_size"] = m_curr_block_size
@@ -1481,7 +1508,7 @@ def _update_param_apollo(
                 if state.get("m_low_q") is None:
                     m_padded_numel = ((grad_low_numel + m_curr_block_size - 1) // m_curr_block_size) * m_curr_block_size
-                    state["m_low_q"] = torch.full((m_padded_numel // 2,), 0x88, dtype=torch.uint8, device=grad_low.device)
+                    state["m_low_q"] = torch.full((m_padded_numel // 2,), 0x77, dtype=torch.uint8, device=grad_low.device)
                     state["m_low_scale"] = torch.ones(m_padded_numel // m_curr_block_size, dtype=torch.float32, device=grad_low.device)
                     state["m_block_size"] = m_curr_block_size
@@ -1575,16 +1602,20 @@ def _update_param_apollo(
     else:
         current_norm_t = torch.linalg.vector_norm(grad_work, ord=2, dtype=torch.float32) * scaling_factor
+    is_finite = torch.isfinite(current_norm_t)
+    current_norm_t = torch.where(is_finite, current_norm_t, torch.zeros_like(current_norm_t))
     fira_threshold = 1.0 + fira_margin
     if "scaled_grad_norm_prev" in state:
         prev_norm_t = state["scaled_grad_norm_prev"]
         if not isinstance(prev_norm_t, Tensor):
             prev_norm_t = torch.tensor(prev_norm_t, device=param_work.device, dtype=torch.float32)
+        is_reset = prev_norm_t < 1e-6
         ratio = current_norm_t / (prev_norm_t + 1e-8)
         limiter = torch.clamp_min(ratio, fira_threshold) / fira_threshold
-        final_scale = scaling_factor / limiter
-        state["scaled_grad_norm_prev"] = current_norm_t / limiter
+        final_scale = torch.where(is_reset, scaling_factor, scaling_factor / limiter)
+        state["scaled_grad_norm_prev"] = torch.where(is_reset, current_norm_t, current_norm_t / limiter)
     else:
         final_scale = scaling_factor
         state["scaled_grad_norm_prev"] = current_norm_t

{adafactor8bit-0.2.2 → adafactor8bit-0.2.5/adafactor8bit.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: adafactor8bit
-Version: 0.2.2
+Version: 0.2.5
 Summary: 8-bit Adafactor Optimizer with Fused CUDA Kernels
 Home-page: https://github.com/yanfeiwong/adafactor-8bit
 Author: WANG YAN
@@ -53,7 +53,7 @@ An enhanced 8-bit Adafactor optimizer featuring fused CUDA kernels, log-space bl
 - **Log-Space Quantization**: Maps the second moment (variance) to the log2 space before 8-bit quantization. This approach accommodates the long-tail distribution of variances, reducing the risk of small second-moment estimates being truncated to zero and improving overall training stability.
 - **Fused CUDA Kernels**: Combines dequantization, EMA updates, Warp-Shuffle reductions, and requantization into single kernels. It utilizes `float4` vectorization to optimize memory bandwidth usage.
-- **Optional 4-bit Packed First Moment**: Stores the first moment (`beta1`) in a physically packed 4-bit format when enabled, providing momentum with minimal additional memory overhead.
+- **Optional NF4 First Moment**: Stores the optional first moment (`beta1`) using Normal Float 4-bit (NF4) non-uniform quantization, preserving small momentum updates while keeping memory overhead minimal.
 - **CAME Confidence Guidance**: Optional Confidence-guided Adaptive Memory Efficient Optimization (CAME) that estimates update confidence from historical momentum and adaptively suppresses unstable update directions, improving training stability and reducing loss spikes.
 - **APOLLO Subspace Projection**: Opt-in random subspace projection that estimates adaptive gradient scaling in a low-rank space, preventing stale second-moment statistics and potentially improving convergence and generalization.
 - **Fira Norm-Growth Limiter**: Suppresses destructive gradient spikes by regulating the relative increase of update norms. Originally used for the APOLLO path, it is now available for the standard Adafactor path as well. It improves training stability and often allows the safe removal of external gradient clipping.
@@ -278,7 +278,7 @@ Enable the APOLLO path to compute gradient scaling factors in a memory-efficient
 - **`apollo_factorize` (Experimental)**: Applies Adafactor's row/column factorization within the low-rank subspace. Mathematically, this leverages the norm-preserving property of random projections to approximate the variance of the primary dimension, while the secondary dimension's variance is estimated across random bases, introducing inherent noise. This dual-compression mechanism drastically reduces optimizer state overhead. Note that for smaller models, the actual VRAM savings might be marginal, and the introduced noise could impact convergence stability. Use with caution.
 - **Fira Limiter Integration**: The APOLLO path automatically applies the Fira Norm-Growth Limiter to the scaled gradients to prevent sudden gradient rises from causing loss spikes. You can adjust its sensitivity using the global `fira_margin` parameter.
-## 🛡️ CAME Confidence-Guided Updates
+## 🧊 CAME Confidence-Guided Updates
 Enable the CAME (Confidence-guided Adaptive Memory Efficient Optimization) path to add a confidence estimation stage after momentum accumulation:
@@ -326,24 +326,25 @@ If you are migrating from optimizers like AdamW, Adafactor's learning rate behav
 *These are safe starting points. Always validate on your own task and batch size.*
 ## 🎓 Acknowledgements
-Thanks to **Noam Shazeer** and **Mitchell Stern** for proposing the original Adafactor algorithm in the paper [Adafactor: Adaptive Learning Rates with Sublinear Memory Cost](https://arxiv.org/abs/1804.04235).
-Thanks to **Tim Dettmers** for the inspiration from the paper [8-BIT OPTIMIZERS VIA BLOCK-WISE QUANTIZATION](https://arxiv.org/abs/2110.02861) and the [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) library.
-Thanks to **Hanqing Zhu**, **Zhenyu Zhang**, and the team for proposing the approximated gradient scaling method in the paper [APOLLO: SGD-Like Memory, AdamW-level Performance](https://arxiv.org/abs/2412.05270).
+This project builds upon the foundational work of several researchers and open-source communities. Sincere thanks to the following for their invaluable contributions:
-Thanks to **Xi Chen**, **Kaituo Feng**, and the team for the Norm-Growth Limiter mechanism introduced in [Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?](https://arxiv.org/abs/2410.01623).
+### Core Algorithm & Optimizer Design
+- **Noam Shazeer & Mitchell Stern** for proposing the original **Adafactor** algorithm ([Adafactor: Adaptive Learning Rates with Sublinear Memory Cost](https://arxiv.org/abs/1804.04235)).
+- **Tim Dettmers** for the inspiration from **8-bit block-wise quantization** ([8-BIT OPTIMIZERS VIA BLOCK-WISE QUANTIZATION](https://arxiv.org/abs/2110.02861)) and the [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) library.
+- **Hanqing Zhu, Zhenyu Zhang, et al.** for the **APOLLO** algorithm ([APOLLO: SGD-Like Memory, AdamW-level Performance](https://arxiv.org/abs/2412.05270)).
+- **Xi Chen, Kaituo Feng, et al.** for the **Norm-Growth Limiter** mechanism in **Fira** ([Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?](https://arxiv.org/abs/2410.01623)).
+- **Yang Luo, et al.** for the **confidence-guided strategy** in **CAME** ([CAME: Confidence-guided Adaptive Memory Efficient Optimization](https://arxiv.org/abs/2307.02047)).
-Thanks to **Yang Luo** and the team for proposing the confidence-guided strategy in the paper [CAME: Confidence-guided Adaptive Memory Efficient Optimization](https://arxiv.org/abs/2307.02047).
+### Quantization & Implementation
+- **The QLoRA Team** for pioneering the **4-bit NormalFloat (NF4)** quantization format ([QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)) that inspired our first moment quantization.
+- **The PyTorch AO Team** for their work on [4-bit optimizer states](https://github.com/pytorch/ao/tree/main/torchao/optim), validating distribution-aware quantization for optimizer moments.
+- **The PyTorch Team** for providing the foundational optimizer implementation and the C++ Extension toolchain.
-Thanks to the **PyTorch team** for providing the foundational Optimizer implementation and the C++ Extension toolchain.
+### Technical Review & Discussion
+- **Qwen, ChatGLM, and DeepSeek** (large language models) for valuable technical discussions and code reviews on CUDA low-level optimization, memory safety mechanisms, and cross-platform compilation pipeline design.
-Thanks to the large language models **Qwen**, **ChatGLM** and **DeepSeek** for valuable technical discussions and code reviews on CUDA low-level optimization and memory safety mechanisms.
 ## 🏛️ License

{adafactor8bit-0.2.2 → adafactor8bit-0.2.5}/setup.py RENAMED Viewed

@@ -9,7 +9,7 @@ long_description = (this_directory / "README.md").read_text(encoding="utf-8")
 setup(
     name="adafactor8bit",
-    version="0.2.2",
+    version="0.2.5",
     description="8-bit Adafactor Optimizer with Fused CUDA Kernels",
     author="WANG YAN",
     author_email="yanfeiwong1997@outlook.com",