npm - numkong - Versions diffs - 7.4.3 → 7.4.4 - Mend

numkong 7.4.3 → 7.4.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/README.md +49 -49
package/binding.gyp +3 -0
package/include/numkong/capabilities.h +1 -1
package/include/numkong/each/haswell.h +4 -4
package/include/numkong/types.h +15 -9
package/numkong.gypi +3 -0
package/package.json +7 -7

package/README.md CHANGED Viewed

@@ -391,24 +391,24 @@ Float16 prioritizes __precision over range__ (10 vs 7 mantissa bits), making it
 On x86, older CPUs use __F16C extensions__ (Ivy Bridge+) for fast Float16 → Float32 conversion; Sapphire Rapids+ adds native __AVX-512-FP16__ with dedicated Float16 arithmetic.
 On Arm, ARMv8.4-A adds __FMLAL/FMLAL2__ instructions for fused Float16 → Float32 widening multiply-accumulate, reducing the total latency from 7 cycles to 4 cycles and achieving 20–48% speedup over the separate convert-then-FMA path.
-| Platform               | BFloat16 Path              | Elem/Op | Float16 Path           | Elem/Op |
-| :--------------------- | :------------------------- | ------: | :--------------------- | ------: |
-| __x86__                |                            |         |                        |         |
-| Diamond Rapids (2026)  | ↓ Genoa                    |      32 | `VDPPHPS` widening dot |      32 |
-| Sapphire Rapids (2023) | ↓ Genoa                    |      32 | ↓ Skylake              |      16 |
-| Genoa (2022)           | `VDPBF16PS` widening dot   |      32 | ↓ Skylake              |      16 |
-| Skylake (2015)         | `SLLI` + `VFMADD`          |      16 | `VCVTPH2PS` + `VFMADD` |      16 |
-| Haswell (2013)         | `SLLI` + `VFMADD`          |       8 | `VCVTPH2PS` + `VFMADD` |       8 |
-| __Arm__                |                            |         |                        |         |
-| Apple M2+ (2022)       | `BFDOT` widening dot       |       8 | ↓ FP16FML              |       8 |
-| Graviton 3+ (2021)     | `SVBFDOT` widening dot     |    4–32 | `SVCVT` → `SVFMLA`     |    4–32 |
-| Apple M1 (2020)        | ↓ NEON                     |       8 | `FMLAL` widening FMA   |       8 |
-| Graviton 2 (2019)      | ↓ NEON                     |       8 | `FCVTL` + `FMLA`       |       4 |
-| Graviton 1 (2018)      | `SHLL` + `FMLA`            |       8 | bit-manip → `FMLA`     |       8 |
-| __RISC-V__             |                            |         |                        |         |
-| RVV + Zvfbfwma         | `VFWMACCBF16` widening FMA |    4–32 | ↓ RVV                  |    4–32 |
-| RVV + Zvfh             | ↓ RVV                      |    4–32 | `VFWMACC` widening FMA |    4–32 |
-| RVV                    | shift + `VFMACC`           |    4–32 | convert + `VFMACC`     |    4–32 |
+| Platform         | BFloat16 Path              | Step | Float16 Path           | Step |
+| :--------------- | :------------------------- | ---: | :--------------------- | ---: |
+| __x86__          |                            |      |                        |      |
+| Diamond, '26     | ↓ Genoa                    |   32 | `VDPPHPS` widening dot |   32 |
+| Sapphire, '23    | ↓ Genoa                    |   32 | ↓ Skylake              |   16 |
+| Genoa, '22       | `VDPBF16PS` widening dot   |   32 | ↓ Skylake              |   16 |
+| Skylake, '15     | `SLLI` + `VFMADD`          |   16 | `VCVTPH2PS` + `VFMADD` |   16 |
+| Haswell, '13     | `SLLI` + `VFMADD`          |    8 | `VCVTPH2PS` + `VFMADD` |    8 |
+| __Arm__          |                            |      |                        |      |
+| Apple M2+, '22   | `BFDOT` widening dot       |    8 | ↓ FP16FML              |    8 |
+| Graviton 3+, '21 | `SVBFDOT` widening dot     | 4–32 | `SVCVT` → `SVFMLA`     | 4–32 |
+| Apple M1, '20    | ↓ NEON                     |    8 | `FMLAL` widening FMA   |    8 |
+| Graviton 2, '19  | ↓ NEON                     |    8 | `FCVTL` + `FMLA`       |    4 |
+| Graviton 1, '18  | `SHLL` + `FMLA`            |    8 | bit-manip → `FMLA`     |    8 |
+| __RISC-V__       |                            |      |                        |      |
+| RVV+Zvfbfwma     | `VFWMACCBF16` widening FMA | 4–32 | ↓ RVV                  | 4–32 |
+| RVV+Zvfh         | ↓ RVV                      | 4–32 | `VFWMACC` widening FMA | 4–32 |
+| RVV              | shift + `VFMACC`           | 4–32 | convert + `VFMACC`     | 4–32 |
 > BFloat16 shares Float32's 8-bit exponent, so upcasting is a 16-bit left shift (`SLLI` on x86, `SHLL` on Arm) that zero-pads the truncated mantissa — essentially free.
 > Float16 has a different exponent width (5 vs 8 bits), requiring a dedicated convert: `VCVTPH2PS` (x86 F16C) or `FCVTL` (Arm NEON).
@@ -444,22 +444,22 @@ E4M3FN (no infinities, NaN only) is preferred for __training__ where precision n
 On x86 Genoa/Sapphire Rapids, E4M3/E5M2 values upcast to BFloat16 via lookup tables, then use native __DPBF16PS__ for 2-per-lane dot products accumulating to Float32.
 On Arm Graviton 3+, the same BFloat16 upcast happens via NEON table lookups, then __BFDOT__ instructions complete the computation.
-| Platform              | E5M2 Path                      | Elem/Op | E4M3 Path                      | Elem/Op |
-| :-------------------- | :----------------------------- | ------: | :----------------------------- | ------: |
-| __x86__               |                                |         |                                |         |
-| Diamond Rapids (2026) | `VCVTBF82PH` → F16 + `VDPPHPS` |      32 | `VCVTHF82PH` → F16 + `VDPPHPS` |      32 |
-| Genoa (2022)          | → BF16 + `VDPBF16PS`           |      32 | ↓ Ice Lake                     |      64 |
-| Ice Lake (2019)       | ↓ Skylake                      |      16 | octave LUT + `VPDPBUSD`        |      64 |
-| Skylake (2015)        | rebias → F32 FMA               |      16 | rebias → F32 FMA               |      16 |
-| Haswell (2013)        | rebias → F32 FMA               |       8 | rebias → F32 FMA               |       8 |
-| __Arm__               |                                |         |                                |         |
-| NEON + FP8DOT (2026)  | native `FDOT`                  |      16 | native `FDOT`                  |      16 |
-| NEON + FP16FML (2020) | SHL → F16 + `FMLAL`            |      16 | LUT → F16 + `FMLAL`            |      16 |
-| NEON (2018)           | SHL + `FCVTL` + FMA            |       8 | → F16 + `FCVTL` + FMA          |       8 |
-| __RISC-V__            |                                |         |                                |         |
-| RVV + Zvfbfwma        | rebias → BF16 + `VFWMACCBF16`  |    4–32 | LUT → BF16 + `VFWMACCBF16`     |    4–32 |
-| RVV + Zvfh            | SHL → F16 + `VFWMACC`          |    4–32 | LUT → F16 + `VFWMACC`          |    4–32 |
-| RVV                   | rebias → F32 + `VFMACC`        |    4–32 | LUT → F32 + `VFMACC`           |    4–32 |
+| Platform          | E5M2 Path                      | Step | E4M3 Path                      | Step |
+| :---------------- | :----------------------------- | ---: | :----------------------------- | ---: |
+| __x86__           |                                |      |                                |      |
+| Diamond, '26      | `VCVTBF82PH` → F16 + `VDPPHPS` |   32 | `VCVTHF82PH` → F16 + `VDPPHPS` |   32 |
+| Genoa, '22        | → BF16 + `VDPBF16PS`           |   32 | ↓ Ice Lake                     |   64 |
+| Ice Lake, '19     | ↓ Skylake                      |   16 | octave LUT + `VPDPBUSD`        |   64 |
+| Skylake, '15      | rebias → F32 FMA               |   16 | rebias → F32 FMA               |   16 |
+| Haswell, '13      | rebias → F32 FMA               |    8 | rebias → F32 FMA               |    8 |
+| __Arm__           |                                |      |                                |      |
+| NEON+FP8DOT, '26  | native `FDOT`                  |   16 | native `FDOT`                  |   16 |
+| NEON+FP16FML, '20 | SHL → F16 + `FMLAL`            |   16 | LUT → F16 + `FMLAL`            |   16 |
+| NEON, '18         | SHL + `FCVTL` + FMA            |    8 | → F16 + `FCVTL` + FMA          |    8 |
+| __RISC-V__        |                                |      |                                |      |
+| RVV+Zvfbfwma      | rebias → BF16 + `VFWMACCBF16`  | 4–32 | LUT → BF16 + `VFWMACCBF16`     | 4–32 |
+| RVV+Zvfh          | SHL → F16 + `VFWMACC`          | 4–32 | LUT → F16 + `VFWMACC`          | 4–32 |
+| RVV               | rebias → F32 + `VFMACC`        | 4–32 | LUT → F32 + `VFMACC`           | 4–32 |
 > E5M2 shares Float16's exponent bias (15), so E5M2 → Float16 conversion is a single left-shift by 8 bits (`SHL 8`).
 > E4M3 on Ice Lake uses "octave decomposition": the 4-bit exponent splits into 2 octave + 2 remainder bits, yielding 7 integer accumulators post-scaled by powers of 2.
@@ -469,23 +469,23 @@ Their smaller range allows scaling to exact integers that fit in `i8`/`i16`, ena
 Float16 can also serve as an accumulator, accurately representing ~50 products of E3M2FN pairs or ~20 products of E2M3FN pairs before overflow.
 On Arm, NEON FHM extensions bring widening `FMLAL` dot-products for Float16 — both faster and more widely available than `BFDOT` for BFloat16.
-| Platform              | E3M2 Path                  | Elem/Op | E2M3 Path                    | Elem/Op |
-| :-------------------- | :------------------------- | ------: | :--------------------------- | ------: |
-| __x86__               |                            |         |                              |         |
-| Sierra Forest (2024)  | ↓ Haswell                  |      32 | `VPSHUFB` LUT + `VPDPBSSD`   |      32 |
-| Alder Lake (2021)     | ↓ Haswell                  |      32 | `VPSHUFB` LUT + `VPDPBUSD`   |      32 |
-| Ice Lake (2019)       | `VPERMW` LUT + `VPMADDWD`  |      32 | `VPERMB` LUT + `VPDPBUSD`    |      64 |
-| Skylake (2015)        | `VPSHUFB` LUT + `VPMADDWD` |      64 | `VPSHUFB` LUT + `VPMADDUBSW` |      64 |
-| Haswell (2013)        | `VPSHUFB` LUT + `VPMADDWD` |      32 | `VPSHUFB` LUT + `VPMADDUBSW` |      32 |
-| __Arm__               |                            |         |                              |         |
-| NEON + FP8DOT (2026)  | → E5M2 + `FDOT`            |      16 | → E4M3 + `FDOT`              |      16 |
-| NEON + DotProd (2019) | `VQTBL2` LUT + `SMLAL`     |      16 | `VQTBL2` LUT + `SDOT`        |      16 |
-| NEON (2018)           | → F16 + `FCVTL` + FMA      |      16 | → F16 + `FCVTL` + FMA        |      16 |
-| __RISC-V__            |                            |         |                              |         |
-| RVV                   | I16 gather LUT + `VWMACC`  |    4–32 | U8 gather LUT + `VWMACC`     |    4–32 |
+| Platform           | E3M2 Path                 | Step | E2M3 Path                | Step |
+| :----------------- | :------------------------ | ---: | :----------------------- | ---: |
+| __x86__            |                           |      |                          |      |
+| Sierra Forest, '24 | ↓ Haswell                 |   32 | `VPSHUFB` + `VPDPBSSD`   |   32 |
+| Alder Lake, '21    | ↓ Haswell                 |   32 | `VPSHUFB` + `VPDPBUSD`   |   32 |
+| Ice Lake, '19      | `VPERMW` + `VPMADDWD`     |   32 | `VPERMB` + `VPDPBUSD`    |   64 |
+| Skylake, '15       | `VPSHUFB` + `VPMADDWD`    |   64 | `VPSHUFB` + `VPMADDUBSW` |   64 |
+| Haswell, '13       | `VPSHUFB` + `VPMADDWD`    |   32 | `VPSHUFB` + `VPMADDUBSW` |   32 |
+| __Arm__            |                           |      |                          |      |
+| NEON+FP8DOT, '26   | → E5M2 + `FDOT`           |   16 | → E4M3 + `FDOT`          |   16 |
+| NEON+DotProd, '19  | `VQTBL2` + `SMLAL`        |   16 | `VQTBL2` + `SDOT`        |   16 |
+| NEON, '18          | → F16 + `FCVTL` + FMA     |   16 | → F16 + `FCVTL` + FMA    |   16 |
+| __RISC-V__         |                           |      |                          |      |
+| RVV                | I16 gather LUT + `VWMACC` | 4–32 | U8 gather LUT + `VWMACC` | 4–32 |
 > E3M2/E2M3 values map to exact integers via 32-entry LUTs (magnitudes up to 448 for E3M2, 120 for E2M3), enabling integer accumulation with no rounding error.
-> On NEON + FP8DOT, E3M2 is first promoted to E5M2 and E2M3 to E4M3 before the hardware `FDOT` instruction.
+> On NEON+FP8DOT, E3M2 is first promoted to E5M2 and E2M3 to E4M3 before the hardware `FDOT` instruction.
 > Sierra Forest and Alder Lake use native `VPDPBSSD` (signed×signed) and `VPDPBUSD` (unsigned×signed) respectively for E2M3.
 E4M3 and E5M2 cannot use the integer path.

package/binding.gyp CHANGED Viewed

@@ -57,6 +57,9 @@
                     "ForcedIncludeFiles": [
                         "<(module_root_dir)/nk_probes.h"
                     ],
+                    "AdditionalOptions": [
+                        "/Zc:preprocessor"
+                    ],
                 },
             },
             "conditions": [

package/include/numkong/capabilities.h CHANGED Viewed

@@ -96,7 +96,7 @@
 #define NK_VERSION_MAJOR 7
 #define NK_VERSION_MINOR 4
-#define NK_VERSION_PATCH 3
+#define NK_VERSION_PATCH 4
 /**
  *  @brief  Removes compile-time dispatching, and replaces it with runtime dispatching.

package/include/numkong/each/haswell.h CHANGED Viewed

@@ -196,7 +196,7 @@ NK_PUBLIC void nk_each_sum_f16_haswell(nk_f16_t const *a, nk_f16_t const *b, nk_
         __m256 a_f32x8 = _mm256_cvtph_ps(a_f16x8);
         __m256 b_f32x8 = _mm256_cvtph_ps(b_f16x8);
         __m256 result_f32x8 = _mm256_add_ps(a_f32x8, b_f32x8);
-        __m128i result_f16x8 = _mm256_cvtps_ph(result_f32x8, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);
+        __m128i result_f16x8 = _mm256_cvtps_ph(result_f32x8, _MM_FROUND_TO_NEAREST_INT);
         _mm_storeu_si128((__m128i *)(result + i), result_f16x8);
     }
@@ -223,7 +223,7 @@ NK_PUBLIC void nk_each_scale_f16_haswell(nk_f16_t const *a, nk_size_t n, nk_f32_
         __m128i a_f16x8 = _mm_loadu_si128((__m128i const *)(a + i));
         __m256 a_f32x8 = _mm256_cvtph_ps(a_f16x8);
         __m256 result_f32x8 = _mm256_fmadd_ps(a_f32x8, alpha_f32x8, beta_f32x8);
-        __m128i result_f16x8 = _mm256_cvtps_ph(result_f32x8, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);
+        __m128i result_f16x8 = _mm256_cvtps_ph(result_f32x8, _MM_FROUND_TO_NEAREST_INT);
         _mm_storeu_si128((__m128i *)(result + i), result_f16x8);
     }
@@ -271,7 +271,7 @@ NK_PUBLIC void nk_each_blend_f16_haswell(              //
         __m256 b_f32x8 = _mm256_cvtph_ps(b_f16x8);
         __m256 a_scaled_f32x8 = _mm256_mul_ps(a_f32x8, alpha_f32x8);
         __m256 result_f32x8 = _mm256_fmadd_ps(b_f32x8, beta_f32x8, a_scaled_f32x8);
-        __m128i result_f16x8 = _mm256_cvtps_ph(result_f32x8, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);
+        __m128i result_f16x8 = _mm256_cvtps_ph(result_f32x8, _MM_FROUND_TO_NEAREST_INT);
         _mm_storeu_si128((__m128i *)(result + i), result_f16x8);
     }
@@ -451,7 +451,7 @@ NK_PUBLIC void nk_each_fma_f16_haswell(                      //
         __m256 ab_f32x8 = _mm256_mul_ps(a_f32x8, b_f32x8);
         __m256 abc_f32x8 = _mm256_mul_ps(ab_f32x8, alpha_f32x8);
         __m256 result_f32x8 = _mm256_fmadd_ps(c_f32x8, beta_f32x8, abc_f32x8);
-        __m128i result_f16x8 = _mm256_cvtps_ph(result_f32x8, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC);
+        __m128i result_f16x8 = _mm256_cvtps_ph(result_f32x8, _MM_FROUND_TO_NEAREST_INT);
         _mm_storeu_si128((__m128i *)(result + i), result_f16x8);
     }

package/include/numkong/types.h CHANGED Viewed

@@ -119,6 +119,12 @@
 #define NK_MAY_ALIAS_
 #endif
+#if defined(__has_builtin)
+#define nk_has_builtin_(x) __has_builtin(x)
+#else
+#define nk_has_builtin_(x) 0
+#endif
 // Allow SIMD kernels to redirect small inputs to serial implementations.
 // Enabled by default for production use. Tests and benchmarks may disable
 // this to isolate SIMD path behavior on small inputs.
@@ -425,7 +431,7 @@
 // AppleClang 17 exposes SME sub-features through `arm_sme.h` builtin aliases,
 // not dedicated `__ARM_FEATURE_*` predefines for every matrix subtype.
 #if !defined(NK_TARGET_SMEF64) || (NK_TARGET_SMEF64 && !NK_TARGET_ARM64_)
-#if defined(__ARM_FEATURE_SME_F64F64) || (defined(__has_builtin) && __has_builtin(__builtin_sme_svmopa_za64_f64_m))
+#if defined(__ARM_FEATURE_SME_F64F64) || nk_has_builtin_(__builtin_sme_svmopa_za64_f64_m)
 #define NK_TARGET_SMEF64 1
 #else
 #undef NK_TARGET_SMEF64
@@ -434,39 +440,39 @@
 #endif // !defined(NK_TARGET_SMEF64) || ...
 #if !defined(NK_TARGET_SMEBI32) || (NK_TARGET_SMEBI32 && !NK_TARGET_ARM64_)
-#if defined(__has_builtin) && __has_builtin(__builtin_sme_svbmopa_za32_u32_m)
+#if nk_has_builtin_(__builtin_sme_svbmopa_za32_u32_m)
 #define NK_TARGET_SMEBI32 1
 #else
 #undef NK_TARGET_SMEBI32
 #define NK_TARGET_SMEBI32 0
-#endif // defined(__has_builtin) && __has_builtin(__builtin_sme_svbmopa_za32_u32_m)
+#endif // nk_has_builtin_(__builtin_sme_svbmopa_za32_u32_m)
 #endif // !defined(NK_TARGET_SMEBI32) || ...
 #if !defined(NK_TARGET_SMEHALF) || (NK_TARGET_SMEHALF && !NK_TARGET_ARM64_)
-#if defined(__ARM_FEATURE_SME_F16F16) || (defined(__has_builtin) && __has_builtin(__builtin_sme_svmopa_za32_f16_m))
+#if defined(__ARM_FEATURE_SME_F16F16) || nk_has_builtin_(__builtin_sme_svmopa_za32_f16_m)
 #define NK_TARGET_SMEHALF 1
 #else
 #undef NK_TARGET_SMEHALF
 #define NK_TARGET_SMEHALF 0
-#endif // defined(__has_builtin) && __has_builtin(__builtin_sme_svmopa_za32_f16_m)
+#endif // nk_has_builtin_(__builtin_sme_svmopa_za32_f16_m)
 #endif // !defined(NK_TARGET_SMEHALF) || ...
 #if !defined(NK_TARGET_SMEBF16) || (NK_TARGET_SMEBF16 && !NK_TARGET_ARM64_)
-#if defined(__has_builtin) && __has_builtin(__builtin_sme_svmopa_za32_bf16_m)
+#if nk_has_builtin_(__builtin_sme_svmopa_za32_bf16_m)
 #define NK_TARGET_SMEBF16 1
 #else
 #undef NK_TARGET_SMEBF16
 #define NK_TARGET_SMEBF16 0
-#endif // defined(__has_builtin) && __has_builtin(__builtin_sme_svmopa_za32_bf16_m)
+#endif // nk_has_builtin_(__builtin_sme_svmopa_za32_bf16_m)
 #endif // !defined(NK_TARGET_SMEBF16) || ...
 #if !defined(NK_TARGET_SMELUT2) || (NK_TARGET_SMELUT2 && !NK_TARGET_ARM64_)
-#if defined(__has_builtin) && __has_builtin(__builtin_sme_svluti2_lane_zt_u8)
+#if nk_has_builtin_(__builtin_sme_svluti2_lane_zt_u8)
 #define NK_TARGET_SMELUT2 1
 #else
 #undef NK_TARGET_SMELUT2
 #define NK_TARGET_SMELUT2 0
-#endif // defined(__has_builtin) && __has_builtin(__builtin_sme_svluti2_lane_zt_u8)
+#endif // nk_has_builtin_(__builtin_sme_svluti2_lane_zt_u8)
 #endif // !defined(NK_TARGET_SMELUT2) || ...
 // Compiling for Arm: NK_TARGET_SMEFA64 (FEAT_SME_FA64, full SVE2 in streaming mode)

package/numkong.gypi CHANGED Viewed

@@ -79,6 +79,9 @@
                     "ForcedIncludeFiles": [
                         "<!(node -e \"console.log(require('path').resolve('<(numkong_root)','nk_probes.h'))\")",
                     ],
+                    "AdditionalOptions": [
+                        "/Zc:preprocessor"
+                    ],
                 },
             },
             "conditions": [

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "numkong",
-  "version": "7.4.3",
+  "version": "7.4.4",
   "description": "Portable mixed-precision math, linear-algebra, & retrieval library with 2000+ SIMD kernels for x86, Arm, RISC-V, LoongArch, Power, & WebAssembly",
   "homepage": "https://github.com/ashvardanian/NumKong",
   "author": "Ash Vardanian",
@@ -98,11 +98,11 @@
     "printWidth": 120
   },
   "optionalDependencies": {
-    "@numkong/darwin-arm64": "7.4.3",
-    "@numkong/darwin-x64": "7.4.3",
-    "@numkong/linux-arm64": "7.4.3",
-    "@numkong/linux-x64": "7.4.3",
-    "@numkong/win32-arm64": "7.4.3",
-    "@numkong/win32-x64": "7.4.3"
+    "@numkong/darwin-arm64": "7.4.4",
+    "@numkong/darwin-x64": "7.4.4",
+    "@numkong/linux-arm64": "7.4.4",
+    "@numkong/linux-x64": "7.4.4",
+    "@numkong/win32-arm64": "7.4.4",
+    "@numkong/win32-x64": "7.4.4"
   }
 }