RubyGems - pure_jpeg - Versions diffs - 0.3.1 → 0.3.2 - Mend

pure_jpeg 0.3.1 → 0.3.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +17 -0
data/README.md +10 -6
data/lib/pure_jpeg/bit_writer.rb +2 -2
data/lib/pure_jpeg/dct.rb +204 -56
data/lib/pure_jpeg/decoder.rb +52 -33
data/lib/pure_jpeg/encoder.rb +51 -43
data/lib/pure_jpeg/huffman/encoder.rb +17 -12
data/lib/pure_jpeg/quantization.rb +13 -2
data/lib/pure_jpeg/version.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 7a0015f811a2250264bfa73727aa37af15fee1af10e4897243045f2f4f54ae07
-  data.tar.gz: 5085ab8c4bd1d9941c94e116b39ea7ca38f658fdf0e48523cae65fc913f765d0
+  metadata.gz: 780f932b176fecdb5daab2546909fe2610325e54f7364cf482e3e2652ab614a5
+  data.tar.gz: 1fbc350f25d09b989ee6262e077bb36847c6637db325fb03c29f2083bb0ec973
 SHA512:
-  metadata.gz: b507962b2ec9650e743b365b8ace5ddb2a9c1d04de126206ac6062c9da46001dc3e8a1f04df356187b3a2458f3383625d864fe12ee0b3c5e3df589755aa540aa
-  data.tar.gz: 7bf019ea4702bbd7379ad3a1d295acaadd47b185815461b810c08c33299248235b326e9d0cdcfbebe23646f32014487c55212517a9e8a246d4fd5d40891eb62f
+  metadata.gz: a385db40804bf4a992d78253ba619e90b147aa5be7808b341398918f5f36c3593fbd1a979aaad9729ad874f2427fe15d31fc9a0413e5f7ea98ee44e43e125f2d
+  data.tar.gz: b9f5581f4c4f27f42460961b3b4231fd94b45be130fbdce74a28bc4888f2a2f3de08fd43b9e3efc782ccb6b4f260aa8484ff7794e7bde1362018dc3deb282b26

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,22 @@
 # Changelog
+## 0.3.2
+Performance:
+- Replaced matrix-multiply float DCT with integer-scaled AAN (Arai-Agui-Nakajima) DCT from the IJG reference implementation -- all-integer, no Float allocations
+- Fixed-point integer arithmetic for RGB/YCbCr color space conversion in both encoder and decoder
+- Eliminated short-lived Array allocations in Huffman encoder (`category_and_bits` split into separate methods)
+- `String#<<` with Integer instead of `byte.chr` to avoid String allocations in bit writer
+- DCT inner loop unrolling to eliminate nested block invocations
+- Unrolled `write_block` and `extract_block_into` inner loops
+- Integer rounding division in quantization (no more Float division + round)
+- Hoisted hash lookups and method calls out of per-pixel loops in decoder
+Result: ~2.9x faster encode, ~4.6x faster decode on Ruby 4.0.2 with YJIT.
+Credits: [Ufuk Kayserilioglu](https://github.com/paracycle)
 ## 0.3.1
 Fixes:

data/README.md CHANGED Viewed

@@ -194,19 +194,19 @@ Decoding:
 Not supported: arithmetic coding, 12-bit precision, EXIF/ICC profile preservation, adding a default background for transparent sources (see what happens above!). Largely because I don't need these, but they are all do-able, especially with how loosely coupled this library is internally. Raise an issue if you really care about them!
-Possible future improvements: AAN/fixed-point DCT (but it's a LOT of work), ICC profile rendering/conversion.
+Possible future improvements: ICC profile rendering/conversion.
 ## Performance
-On a 1024x1024 image (Ruby 4.0.1 on my M5):
+On a 1024x1024 image (Ruby 4.0.2 with YJIT on an M5):
 | Operation | Time |
 |-----------|------|
-| Encode (color, q85) | ~1.2s |
-| Decode (baseline) | ~1.2s |
-| Decode (progressive) | ~1.3s |
+| Encode (color, q85) | ~0.16s |
+| Decode (baseline) | ~0.14s |
+| Decode (progressive) | ~0.18s |
-Both the encoder and decoder use a separable DCT with a precomputed cosine matrix and reuse all per-block buffers to minimize GC pressure. Pixel data is stored as packed integers internally to avoid per-pixel object allocation.
+The encoder and decoder use an integer-scaled AAN (Arai-Agui-Nakajima) DCT with fixed-point arithmetic throughout — no Float operations in the hot path. Color space conversion uses fixed-point integer math, and pixel data is stored as packed integers to avoid per-pixel object allocation.
 ## Some useful `rake` tasks
@@ -233,6 +233,10 @@ rake profile     # CPU profile with StackProf (requires the stackprof gem)
 **The final 10% still takes 90% of the time.** As mentioned above, the first run was quick, but getting things right has taken much longer. v0.1->0.2 has taken longer than 0.1 did! But we now have progressive JPEG support, even more optimizations, better tests, etc. etc.
+## Credits
+- [Ufuk Kayserilioglu](https://github.com/paracycle) - Major performance optimizations including integer-scaled AAN DCT, fixed-point color space conversion, and YJIT-targeted improvements.
 ## License
 MIT

data/lib/pure_jpeg/bit_writer.rb CHANGED Viewed

@@ -17,8 +17,8 @@ module PureJPEG
       while @bits_in_buffer >= 8
         @bits_in_buffer -= 8
         byte = (@buffer >> @bits_in_buffer) & 0xFF
-        @data << byte.chr
-        @data << "\x00".b if byte == 0xFF  # byte stuffing
+        @data << byte
+        @data << 0x00 if byte == 0xFF  # byte stuffing
       end
       @buffer &= (1 << @bits_in_buffer) - 1

data/lib/pure_jpeg/dct.rb CHANGED Viewed

@@ -1,10 +1,15 @@
 # frozen_string_literal: true
 module PureJPEG
+  # Integer-scaled DCT based on the IJG (Independent JPEG Group) reference
+  # implementation (jfdctint.c / jidctint.c). Uses the Arai-Agui-Nakajima
+  # factorization with 13-bit fixed-point constants.
+  #
+  # All arithmetic is pure Integer (additions, shifts, multiplies) — no Float
+  # operations. This is ~3x faster than the matrix-multiply float DCT under
+  # YJIT and eliminates millions of Float object allocations during decode.
   module DCT
-    # Precomputed 8x8 DCT matrix: A[k][n] = (C(k)/2) * cos((2n+1)*k*pi/16)
-    # where C(0) = 1/sqrt(2), C(k) = 1 for k > 0.
-    # This lets us do the 2D DCT as two 1D matrix-vector multiplies (separable).
+    # Keep the float matrix available for reference / testing
     MATRIX = Array.new(8) { |k|
       ck = k == 0 ? 0.5 / Math.sqrt(2.0) : 0.5
       Array.new(8) { |n|
@@ -12,72 +17,215 @@ module PureJPEG
       }
     }.freeze
-    # Flatten for faster indexed access
     MATRIX_FLAT = MATRIX.flatten.freeze
-    # Transposed matrix for inverse DCT: A^T[n][k] = A[k][n]
     MATRIX_T_FLAT = Array.new(64) { |i| MATRIX_FLAT[(i % 8) * 8 + i / 8] }.freeze
-    # Separable forward 2D DCT: row pass then column pass.
-    # Writes result into `out`. Uses `temp` as scratch space.
-    # All three arrays must be pre-allocated with 64 elements.
-    def self.forward!(block, temp, out)
-      # Row pass: temp[y*8+u] = sum_x A[u][x] * block[y*8+x]
-      m = MATRIX_FLAT
-      8.times do |y|
-        y8 = y << 3
-        b0 = block[y8]; b1 = block[y8|1]; b2 = block[y8|2]; b3 = block[y8|3]
-        b4 = block[y8|4]; b5 = block[y8|5]; b6 = block[y8|6]; b7 = block[y8|7]
-        8.times do |u|
-          u8 = u << 3
-          temp[y8|u] = m[u8]*b0 + m[u8|1]*b1 + m[u8|2]*b2 + m[u8|3]*b3 +
-                       m[u8|4]*b4 + m[u8|5]*b5 + m[u8|6]*b6 + m[u8|7]*b7
-        end
+    # Fixed-point constants (13-bit precision) from IJG reference.
+    CONST_BITS = 13
+    PASS1_BITS = 2
+    FIX_0_298631336 = 2446
+    FIX_0_390180644 = 3196
+    FIX_0_541196100 = 4433
+    FIX_0_765366865 = 6270
+    FIX_0_899976223 = 7373
+    FIX_1_175875602 = 9633
+    FIX_1_501321110 = 12299
+    FIX_1_847759065 = 15137
+    FIX_1_961570560 = 16069
+    FIX_2_053119869 = 16819
+    FIX_2_562915447 = 20995
+    FIX_3_072711026 = 25172
+    CB = CONST_BITS
+    P1 = PASS1_BITS
+    CB_M_P1 = CB - P1        # 11
+    CB_P_P1_P3 = CB + P1 + 3 # 18
+    P1_P3 = P1 + 3           # 5
+    CB2_P_P1 = CB * 2 + P1   # 28  (unused, was for column even-multiplied path)
+    # Forward 2D DCT (in-place). Input: 64-element array of level-shifted
+    # integers (-128..127). Output: DCT coefficients (integers).
+    # The `_temp` and `_out` parameters are accepted for API compatibility
+    # but ignored; computation is done in-place on `data`.
+    def self.forward!(data, _temp = nil, _out = nil)
+      # Pass 1: process rows
+      8.times do |row|
+        i = row << 3
+        d0 = data[i]; d1 = data[i+1]; d2 = data[i+2]; d3 = data[i+3]
+        d4 = data[i+4]; d5 = data[i+5]; d6 = data[i+6]; d7 = data[i+7]
+        tmp0 = d0 + d7; tmp7 = d0 - d7
+        tmp1 = d1 + d6; tmp6 = d1 - d6
+        tmp2 = d2 + d5; tmp5 = d2 - d5
+        tmp3 = d3 + d4; tmp4 = d3 - d4
+        # Even part
+        tmp10 = tmp0 + tmp3; tmp13 = tmp0 - tmp3
+        tmp11 = tmp1 + tmp2; tmp12 = tmp1 - tmp2
+        data[i]   = (tmp10 + tmp11) << P1
+        data[i+4] = (tmp10 - tmp11) << P1
+        z1 = (tmp12 + tmp13) * FIX_0_541196100
+        data[i+2] = (z1 + tmp13 * FIX_0_765366865 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
+        data[i+6] = (z1 - tmp12 * FIX_1_847759065 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
+        # Odd part
+        z1 = tmp4 + tmp7; z2 = tmp5 + tmp6
+        z3 = tmp4 + tmp6; z4 = tmp5 + tmp7
+        z5 = (z3 + z4) * FIX_1_175875602
+        tmp4 = tmp4 * FIX_0_298631336
+        tmp5 = tmp5 * FIX_2_053119869
+        tmp6 = tmp6 * FIX_3_072711026
+        tmp7 = tmp7 * FIX_1_501321110
+        z1 = z1 * -FIX_0_899976223
+        z2 = z2 * -FIX_2_562915447
+        z3 = z3 * -FIX_1_961570560 + z5
+        z4 = z4 * -FIX_0_390180644 + z5
+        data[i+7] = (tmp4 + z1 + z3 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
+        data[i+5] = (tmp5 + z2 + z4 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
+        data[i+3] = (tmp6 + z2 + z3 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
+        data[i+1] = (tmp7 + z1 + z4 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
       end
-      # Column pass: out[v*8+u] = sum_y A[v][y] * temp[y*8+u]
-      8.times do |u|
-        t0 = temp[u]; t1 = temp[8|u]; t2 = temp[16|u]; t3 = temp[24|u]
-        t4 = temp[32|u]; t5 = temp[40|u]; t6 = temp[48|u]; t7 = temp[56|u]
-        8.times do |v|
-          v8 = v << 3
-          out[v8|u] = m[v8]*t0 + m[v8|1]*t1 + m[v8|2]*t2 + m[v8|3]*t3 +
-                      m[v8|4]*t4 + m[v8|5]*t5 + m[v8|6]*t6 + m[v8|7]*t7
-        end
+      # Pass 2: process columns
+      8.times do |col|
+        d0 = data[col]; d1 = data[col+8]; d2 = data[col+16]; d3 = data[col+24]
+        d4 = data[col+32]; d5 = data[col+40]; d6 = data[col+48]; d7 = data[col+56]
+        tmp0 = d0 + d7; tmp7 = d0 - d7
+        tmp1 = d1 + d6; tmp6 = d1 - d6
+        tmp2 = d2 + d5; tmp5 = d2 - d5
+        tmp3 = d3 + d4; tmp4 = d3 - d4
+        tmp10 = tmp0 + tmp3; tmp13 = tmp0 - tmp3
+        tmp11 = tmp1 + tmp2; tmp12 = tmp1 - tmp2
+        data[col]    = (tmp10 + tmp11 + (1 << (P1_P3 - 1))) >> P1_P3
+        data[col+32] = (tmp10 - tmp11 + (1 << (P1_P3 - 1))) >> P1_P3
+        z1 = (tmp12 + tmp13) * FIX_0_541196100
+        data[col+16] = (z1 + tmp13 * FIX_0_765366865 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
+        data[col+48] = (z1 - tmp12 * FIX_1_847759065 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
+        z1 = tmp4 + tmp7; z2 = tmp5 + tmp6
+        z3 = tmp4 + tmp6; z4 = tmp5 + tmp7
+        z5 = (z3 + z4) * FIX_1_175875602
+        tmp4 = tmp4 * FIX_0_298631336
+        tmp5 = tmp5 * FIX_2_053119869
+        tmp6 = tmp6 * FIX_3_072711026
+        tmp7 = tmp7 * FIX_1_501321110
+        z1 = z1 * -FIX_0_899976223
+        z2 = z2 * -FIX_2_562915447
+        z3 = z3 * -FIX_1_961570560 + z5
+        z4 = z4 * -FIX_0_390180644 + z5
+        data[col+56] = (tmp4 + z1 + z3 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
+        data[col+40] = (tmp5 + z2 + z4 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
+        data[col+24] = (tmp6 + z2 + z3 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
+        data[col+8]  = (tmp7 + z1 + z4 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
       end
-      out
+      data
     end
-    # Separable inverse 2D DCT: same structure as forward but using A^T.
-    # f = A^T * F * A
-    def self.inverse!(block, temp, out)
-      mt = MATRIX_T_FLAT
-      # Row pass: temp[v*8+x] = sum_u A^T[x][u] * block[v*8+u]
-      8.times do |v|
-        v8 = v << 3
-        b0 = block[v8]; b1 = block[v8|1]; b2 = block[v8|2]; b3 = block[v8|3]
-        b4 = block[v8|4]; b5 = block[v8|5]; b6 = block[v8|6]; b7 = block[v8|7]
-        8.times do |x|
-          x8 = x << 3
-          temp[v8|x] = mt[x8]*b0 + mt[x8|1]*b1 + mt[x8|2]*b2 + mt[x8|3]*b3 +
-                        mt[x8|4]*b4 + mt[x8|5]*b5 + mt[x8|6]*b6 + mt[x8|7]*b7
-        end
+    # Inverse 2D DCT (in-place). Input: dequantized DCT coefficients (integers).
+    # Output: spatial-domain values (integers) that still need +128 level shift.
+    def self.inverse!(data, _temp = nil, _out = nil)
+      # Pass 1: process columns
+      8.times do |col|
+        d0 = data[col]; d2 = data[col+16]; d4 = data[col+32]; d6 = data[col+48]
+        d1 = data[col+8]; d3 = data[col+24]; d5 = data[col+40]; d7 = data[col+56]
+        # Even part
+        z1 = (d2 + d6) * FIX_0_541196100
+        tmp2 = z1 - d6 * FIX_1_847759065
+        tmp3 = z1 + d2 * FIX_0_765366865
+        tmp0 = (d0 + d4) << CB
+        tmp1 = (d0 - d4) << CB
+        tmp10 = tmp0 + tmp3; tmp13 = tmp0 - tmp3
+        tmp11 = tmp1 + tmp2; tmp12 = tmp1 - tmp2
+        # Odd part
+        tmp0 = d7; tmp1 = d5; tmp2 = d3; tmp3 = d1
+        z1 = tmp0 + tmp3; z2 = tmp1 + tmp2
+        z3 = tmp0 + tmp2; z4 = tmp1 + tmp3
+        z5 = (z3 + z4) * FIX_1_175875602
+        tmp0 = tmp0 * FIX_0_298631336
+        tmp1 = tmp1 * FIX_2_053119869
+        tmp2 = tmp2 * FIX_3_072711026
+        tmp3 = tmp3 * FIX_1_501321110
+        z1 = z1 * -FIX_0_899976223
+        z2 = z2 * -FIX_2_562915447
+        z3 = z3 * -FIX_1_961570560 + z5
+        z4 = z4 * -FIX_0_390180644 + z5
+        tmp0 += z1 + z3; tmp1 += z2 + z4
+        tmp2 += z2 + z3; tmp3 += z1 + z4
+        data[col]    = (tmp10 + tmp3 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
+        data[col+56] = (tmp10 - tmp3 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
+        data[col+8]  = (tmp11 + tmp2 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
+        data[col+48] = (tmp11 - tmp2 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
+        data[col+16] = (tmp12 + tmp1 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
+        data[col+40] = (tmp12 - tmp1 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
+        data[col+24] = (tmp13 + tmp0 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
+        data[col+32] = (tmp13 - tmp0 + (1 << (CB_M_P1 - 1))) >> CB_M_P1
       end
-      # Column pass: out[y*8+x] = sum_v A^T[y][v] * temp[v*8+x]
-      8.times do |x|
-        t0 = temp[x]; t1 = temp[8|x]; t2 = temp[16|x]; t3 = temp[24|x]
-        t4 = temp[32|x]; t5 = temp[40|x]; t6 = temp[48|x]; t7 = temp[56|x]
-        8.times do |y|
-          y8 = y << 3
-          out[y8|x] = mt[y8]*t0 + mt[y8|1]*t1 + mt[y8|2]*t2 + mt[y8|3]*t3 +
-                       mt[y8|4]*t4 + mt[y8|5]*t5 + mt[y8|6]*t6 + mt[y8|7]*t7
-        end
+      # Pass 2: process rows
+      8.times do |row|
+        i = row << 3
+        d0 = data[i]; d2 = data[i+2]; d4 = data[i+4]; d6 = data[i+6]
+        d1 = data[i+1]; d3 = data[i+3]; d5 = data[i+5]; d7 = data[i+7]
+        # Even part
+        z1 = (d2 + d6) * FIX_0_541196100
+        tmp2 = z1 - d6 * FIX_1_847759065
+        tmp3 = z1 + d2 * FIX_0_765366865
+        tmp0 = (d0 + d4) << CB
+        tmp1 = (d0 - d4) << CB
+        tmp10 = tmp0 + tmp3; tmp13 = tmp0 - tmp3
+        tmp11 = tmp1 + tmp2; tmp12 = tmp1 - tmp2
+        # Odd part
+        tmp0 = d7; tmp1 = d5; tmp2 = d3; tmp3 = d1
+        z1 = tmp0 + tmp3; z2 = tmp1 + tmp2
+        z3 = tmp0 + tmp2; z4 = tmp1 + tmp3
+        z5 = (z3 + z4) * FIX_1_175875602
+        tmp0 = tmp0 * FIX_0_298631336
+        tmp1 = tmp1 * FIX_2_053119869
+        tmp2 = tmp2 * FIX_3_072711026
+        tmp3 = tmp3 * FIX_1_501321110
+        z1 = z1 * -FIX_0_899976223
+        z2 = z2 * -FIX_2_562915447
+        z3 = z3 * -FIX_1_961570560 + z5
+        z4 = z4 * -FIX_0_390180644 + z5
+        tmp0 += z1 + z3; tmp1 += z2 + z4
+        tmp2 += z2 + z3; tmp3 += z1 + z4
+        data[i]   = (tmp10 + tmp3 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
+        data[i+7] = (tmp10 - tmp3 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
+        data[i+1] = (tmp11 + tmp2 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
+        data[i+6] = (tmp11 - tmp2 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
+        data[i+2] = (tmp12 + tmp1 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
+        data[i+5] = (tmp12 - tmp1 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
+        data[i+3] = (tmp13 + tmp0 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
+        data[i+4] = (tmp13 - tmp0 + (1 << (CB_P_P1_P3 - 1))) >> CB_P_P1_P3
       end
-      out
+      data
     end
   end
 end

data/lib/pure_jpeg/decoder.rb CHANGED Viewed

@@ -78,10 +78,8 @@ module PureJPEG
       # Reusable buffers
       zigzag = Array.new(64, 0)
-      raster = Array.new(64, 0.0)
-      dequant = Array.new(64, 0.0)
-      temp = Array.new(64, 0.0)
-      spatial = Array.new(64, 0.0)
+      raster = Array.new(64, 0)
+      dequant = Array.new(64, 0)
       mcus_y.times do |mcu_row|
         mcus_x.times do |mcu_col|
@@ -104,12 +102,12 @@ module PureJPEG
                 # Inverse pipeline: unzigzag -> dequantize -> IDCT -> level shift
                 Zigzag.unreorder!(zigzag, raster)
                 Quantization.dequantize!(raster, qt, dequant)
-                DCT.inverse!(dequant, temp, spatial)
+                DCT.inverse!(dequant)
                 # Write block into channel buffer
                 bx = (mcu_col * comp.h_sampling + bh) * 8
                 by = (mcu_row * comp.v_sampling + bv) * 8
-                write_block(spatial, ch[:data], ch[:width], bx, by)
+                write_block(dequant, ch[:data], ch[:width], bx, by)
               end
             end
           end
@@ -204,10 +202,8 @@ module PureJPEG
       end
       zigzag = Array.new(64, 0)
-      raster = Array.new(64, 0.0)
-      dequant = Array.new(64, 0.0)
-      temp = Array.new(64, 0.0)
-      spatial = Array.new(64, 0.0)
+      raster = Array.new(64, 0)
+      dequant = Array.new(64, 0)
       jfif.components.each do |c|
         qt = fetch_quant_table!(jfif, c)
@@ -222,8 +218,8 @@ module PureJPEG
             Zigzag.unreorder!(zigzag, raster)
             Quantization.dequantize!(raster, qt, dequant)
-            DCT.inverse!(dequant, temp, spatial)
-            write_block(spatial, ch[:data], ch[:width], block_x * 8, block_y * 8)
+            DCT.inverse!(dequant)
+            write_block(dequant, ch[:data], ch[:width], block_x * 8, block_y * 8)
           end
         end
       end
@@ -460,12 +456,16 @@ module PureJPEG
     # Write an 8x8 spatial block (level-shifted by +128) into a channel buffer.
     def write_block(spatial, channel, ch_width, bx, by)
       8.times do |row|
-        dst_row = (by + row) * ch_width + bx
-        row8 = row << 3
-        8.times do |col|
-          val = (spatial[row8 | col] + 128.0).round
-          channel[dst_row + col] = val < 0 ? 0 : (val > 255 ? 255 : val)
-        end
+        dst = (by + row) * ch_width + bx
+        r8 = row << 3
+        v = spatial[r8]     + 128; channel[dst]     = v < 0 ? 0 : (v > 255 ? 255 : v)
+        v = spatial[r8 | 1] + 128; channel[dst + 1] = v < 0 ? 0 : (v > 255 ? 255 : v)
+        v = spatial[r8 | 2] + 128; channel[dst + 2] = v < 0 ? 0 : (v > 255 ? 255 : v)
+        v = spatial[r8 | 3] + 128; channel[dst + 3] = v < 0 ? 0 : (v > 255 ? 255 : v)
+        v = spatial[r8 | 4] + 128; channel[dst + 4] = v < 0 ? 0 : (v > 255 ? 255 : v)
+        v = spatial[r8 | 5] + 128; channel[dst + 5] = v < 0 ? 0 : (v > 255 ? 255 : v)
+        v = spatial[r8 | 6] + 128; channel[dst + 6] = v < 0 ? 0 : (v > 255 ? 255 : v)
+        v = spatial[r8 | 7] + 128; channel[dst + 7] = v < 0 ? 0 : (v > 255 ? 255 : v)
       end
     end
@@ -493,18 +493,27 @@ module PureJPEG
     def assemble_grayscale(width, height, channels, comp)
       ch = channels[comp.id]
+      ch_data = ch[:data]
+      ch_width = ch[:width]
       pixels = Array.new(width * height)
       height.times do |y|
-        src_row = y * ch[:width]
+        src_row = y * ch_width
         dst_row = y * width
         width.times do |x|
-          v = ch[:data][src_row + x]
+          v = ch_data[src_row + x]
           pixels[dst_row + x] = (v << 16) | (v << 8) | v
         end
       end
       Image.new(width, height, pixels, icc_profile: @icc_profile)
     end
+    # Fixed-point coefficients (scaled by 2^16) for YCbCr→RGB.
+    FP_R_CR =  91881  # 1.402    * 65536
+    FP_G_CB = -22554  # -0.344136 * 65536
+    FP_G_CR = -46802  # -0.714136 * 65536
+    FP_B_CB = 116130  # 1.772    * 65536
+    FP_HALF =  32768  # rounding bias
     def assemble_color(width, height, channels, components, max_h, max_v)
       # Upsample chroma channels if needed and convert YCbCr to RGB
       y_comp, cb_comp, cr_comp = resolve_color_components(components)
@@ -513,29 +522,39 @@ module PureJPEG
       cb_ch = channels[cb_comp.id]
       cr_ch = channels[cr_comp.id]
+      y_data = y_ch[:data]
+      cb_data = cb_ch[:data]
+      cr_data = cr_ch[:data]
+      y_stride = y_ch[:width]
+      cb_stride = cb_ch[:width]
+      cr_stride = cr_ch[:width]
+      cb_h = cb_comp.h_sampling
+      cb_v = cb_comp.v_sampling
+      cr_h = cr_comp.h_sampling
+      cr_v = cr_comp.v_sampling
       pixels = Array.new(width * height)
       height.times do |py|
         dst_row = py * width
-        y_row = py * y_ch[:width]
+        y_row = py * y_stride
         # Chroma coordinates (nearest-neighbor upsampling)
-        cb_y = (py * cb_comp.v_sampling) / max_v
-        cr_y = (py * cr_comp.v_sampling) / max_v
-        cb_row = cb_y * cb_ch[:width]
-        cr_row = cr_y * cr_ch[:width]
+        cb_row = ((py * cb_v) / max_v) * cb_stride
+        cr_row = ((py * cr_v) / max_v) * cr_stride
         width.times do |px|
-          lum = y_ch[:data][y_row + px]
+          lum = y_data[y_row + px]
-          cb_x = (px * cb_comp.h_sampling) / max_h
-          cr_x = (px * cr_comp.h_sampling) / max_h
-          cb = cb_ch[:data][cb_row + cb_x] - 128.0
-          cr = cr_ch[:data][cr_row + cr_x] - 128.0
+          cb_x = (px * cb_h) / max_h
+          cr_x = (px * cr_h) / max_h
+          cb_val = cb_data[cb_row + cb_x] - 128
+          cr_val = cr_data[cr_row + cr_x] - 128
-          r = (lum + 1.402 * cr).round
-          g = (lum - 0.344136 * cb - 0.714136 * cr).round
-          b = (lum + 1.772 * cb).round
+          # Fixed-point YCbCr→RGB (all integer arithmetic)
+          r = lum + ((FP_R_CR * cr_val + FP_HALF) >> 16)
+          g = lum + ((FP_G_CB * cb_val + FP_G_CR * cr_val + FP_HALF) >> 16)
+          b = lum + ((FP_B_CB * cb_val + FP_HALF) >> 16)
           r = r < 0 ? 0 : (r > 255 ? 255 : r)
           g = g < 0 ? 0 : (g > 255 ? 255 : g)

data/lib/pure_jpeg/encoder.rb CHANGED Viewed

@@ -205,17 +205,14 @@ module PureJPEG
       padded_w = (width + 7) & ~7
       padded_h = (height + 7) & ~7
-      block = Array.new(64, 0.0)
-      temp  = Array.new(64, 0.0)
-      dct   = Array.new(64, 0.0)
+      block = Array.new(64, 0)
       qbuf  = Array.new(64, 0)
       zbuf  = Array.new(64, 0)
       (0...padded_h).step(8) do |by|
         (0...padded_w).step(8) do |bx|
           extract_block_into(y_data, width, height, bx, by, block)
-          transform_block(block, temp, dct, qbuf, zbuf, qtable)
-          yield zbuf
+          yield transform_block(block, qbuf, zbuf, qtable)
         end
       end
     end
@@ -278,37 +275,29 @@ module PureJPEG
       mcu_w = (width + 15) & ~15
       mcu_h = (height + 15) & ~15
-      block = Array.new(64, 0.0)
-      temp  = Array.new(64, 0.0)
-      dct   = Array.new(64, 0.0)
+      block = Array.new(64, 0)
       qbuf  = Array.new(64, 0)
       zbuf  = Array.new(64, 0)
       (0...mcu_h).step(16) do |my|
         (0...mcu_w).step(16) do |mx|
           extract_block_into(y_data, width, height, mx, my, block)
-          transform_block(block, temp, dct, qbuf, zbuf, lum_qt)
-          yield :y, zbuf
+          yield :y, transform_block(block, qbuf, zbuf, lum_qt)
           extract_block_into(y_data, width, height, mx + 8, my, block)
-          transform_block(block, temp, dct, qbuf, zbuf, lum_qt)
-          yield :y, zbuf
+          yield :y, transform_block(block, qbuf, zbuf, lum_qt)
           extract_block_into(y_data, width, height, mx, my + 8, block)
-          transform_block(block, temp, dct, qbuf, zbuf, lum_qt)
-          yield :y, zbuf
+          yield :y, transform_block(block, qbuf, zbuf, lum_qt)
           extract_block_into(y_data, width, height, mx + 8, my + 8, block)
-          transform_block(block, temp, dct, qbuf, zbuf, lum_qt)
-          yield :y, zbuf
+          yield :y, transform_block(block, qbuf, zbuf, lum_qt)
           extract_block_into(cb_sub, sub_w, sub_h, mx >> 1, my >> 1, block)
-          transform_block(block, temp, dct, qbuf, zbuf, chr_qt)
-          yield :cb, zbuf
+          yield :cb, transform_block(block, qbuf, zbuf, chr_qt)
           extract_block_into(cr_sub, sub_w, sub_h, mx >> 1, my >> 1, block)
-          transform_block(block, temp, dct, qbuf, zbuf, chr_qt)
-          yield :cr, zbuf
+          yield :cr, transform_block(block, qbuf, zbuf, chr_qt)
         end
       end
     end
@@ -333,9 +322,9 @@ module PureJPEG
     # --- Shared block pipeline (all buffers pre-allocated) ---
-    def transform_block(block, temp, dct, qbuf, zbuf, qtable)
-      DCT.forward!(block, temp, dct)
-      Quantization.quantize!(dct, qtable, qbuf)
+    def transform_block(block, qbuf, zbuf, qtable)
+      DCT.forward!(block)
+      Quantization.quantize!(block, qtable, qbuf)
       Zigzag.reorder!(qbuf, zbuf)
       zbuf
     end
@@ -352,26 +341,42 @@ module PureJPEG
       end
     end
+    # Fixed-point coefficients (scaled by 2^16 = 65536) for RGB→YCbCr.
+    # Y  =  0.299*R + 0.587*G + 0.114*B
+    # Cb = -0.168736*R - 0.331264*G + 0.5*B + 128
+    # Cr =  0.5*R - 0.418688*G - 0.081312*B + 128
+    FP_Y_R  =  19595; FP_Y_G  =  38470; FP_Y_B  =   7471
+    FP_CB_R = -11058; FP_CB_G = -21710; FP_CB_B =  32768
+    FP_CR_R =  32768; FP_CR_G = -27440; FP_CR_B =  -5328
+    FP_HALF =  32768  # rounding bias
+    FP_128  = 8388608 # 128 << 16
+    def clamp255(v)
+      v < 0 ? 0 : (v > 255 ? 255 : v)
+    end
     def extract_luminance(width, height)
       luminance = Array.new(width * height)
       if source.respond_to?(:packed_pixels)
         packed = source.packed_pixels
         r_shift, g_shift, b_shift = packed_shifts
+        n = width * height
         i = 0
-        (width * height).times do
+        n.times do
           color = packed[i]
           r = (color >> r_shift) & 0xFF
           g = (color >> g_shift) & 0xFF
           b = (color >> b_shift) & 0xFF
-          luminance[i] = (0.299 * r + 0.587 * g + 0.114 * b).round.clamp(0, 255)
+          luminance[i] = clamp255((FP_Y_R * r + FP_Y_G * g + FP_Y_B * b + FP_HALF) >> 16)
           i += 1
         end
       else
-        height.times do |y|
-          row = y * width
-          width.times do |x|
-            pixel = source[x, y]
-            luminance[row + x] = (0.299 * pixel.r + 0.587 * pixel.g + 0.114 * pixel.b).round.clamp(0, 255)
+        height.times do |py|
+          row = py * width
+          width.times do |px|
+            pixel = source[px, py]
+            r = pixel.r; g = pixel.g; b = pixel.b
+            luminance[row + px] = clamp255((FP_Y_R * r + FP_Y_G * g + FP_Y_B * b + FP_HALF) >> 16)
           end
         end
       end
@@ -393,9 +398,9 @@ module PureJPEG
           r = (color >> r_shift) & 0xFF
           g = (color >> g_shift) & 0xFF
           b = (color >> b_shift) & 0xFF
-          y_data[i]  = ( 0.299    * r + 0.587    * g + 0.114    * b).round.clamp(0, 255)
-          cb_data[i] = (-0.168736 * r - 0.331264 * g + 0.5      * b + 128.0).round.clamp(0, 255)
-          cr_data[i] = ( 0.5      * r - 0.418688 * g - 0.081312 * b + 128.0).round.clamp(0, 255)
+          y_data[i]  = clamp255((FP_Y_R * r + FP_Y_G * g + FP_Y_B * b + FP_HALF) >> 16)
+          cb_data[i] = clamp255((FP_CB_R * r + FP_CB_G * g + FP_CB_B * b + FP_128 + FP_HALF) >> 16)
+          cr_data[i] = clamp255((FP_CR_R * r + FP_CR_G * g + FP_CR_B * b + FP_128 + FP_HALF) >> 16)
           i += 1
         end
       else
@@ -405,9 +410,9 @@ module PureJPEG
             pixel = source[px, py]
             r = pixel.r; g = pixel.g; b = pixel.b
             i = row + px
-            y_data[i]  = ( 0.299    * r + 0.587    * g + 0.114    * b).round.clamp(0, 255)
-            cb_data[i] = (-0.168736 * r - 0.331264 * g + 0.5      * b + 128.0).round.clamp(0, 255)
-            cr_data[i] = ( 0.5      * r - 0.418688 * g - 0.081312 * b + 128.0).round.clamp(0, 255)
+            y_data[i]  = clamp255((FP_Y_R * r + FP_Y_G * g + FP_Y_B * b + FP_HALF) >> 16)
+            cb_data[i] = clamp255((FP_CB_R * r + FP_CB_G * g + FP_CB_B * b + FP_128 + FP_HALF) >> 16)
+            cr_data[i] = clamp255((FP_CR_R * r + FP_CR_G * g + FP_CR_B * b + FP_128 + FP_HALF) >> 16)
           end
         end
       end
@@ -442,13 +447,16 @@ module PureJPEG
       8.times do |row|
         sy = by + row
         sy = max_y if sy > max_y
-        src_row = sy * width
-        row8 = row << 3
-        8.times do |col|
-          sx = bx + col
-          sx = max_x if sx > max_x
-          block[row8 | col] = channel[src_row + sx] - 128.0
-        end
+        src = sy * width
+        r8 = row << 3
+        x = bx;     block[r8]     = channel[src + (x > max_x ? max_x : x)] - 128
+        x = bx + 1; block[r8 | 1] = channel[src + (x > max_x ? max_x : x)] - 128
+        x = bx + 2; block[r8 | 2] = channel[src + (x > max_x ? max_x : x)] - 128
+        x = bx + 3; block[r8 | 3] = channel[src + (x > max_x ? max_x : x)] - 128
+        x = bx + 4; block[r8 | 4] = channel[src + (x > max_x ? max_x : x)] - 128
+        x = bx + 5; block[r8 | 5] = channel[src + (x > max_x ? max_x : x)] - 128
+        x = bx + 6; block[r8 | 6] = channel[src + (x > max_x ? max_x : x)] - 128
+        x = bx + 7; block[r8 | 7] = channel[src + (x > max_x ? max_x : x)] - 128
       end
       block
     end

data/lib/pure_jpeg/huffman/encoder.rb CHANGED Viewed

@@ -3,17 +3,22 @@
 module PureJPEG
   module Huffman
     class Encoder
-      def self.category_and_bits(value)
-        return [0, 0] if value == 0
-        abs_val = value.abs
+      # Return the Huffman category (bit length) for a value.
+      # Avoids Array allocation compared to the combined category_and_bits.
+      def self.category(value)
+        return 0 if value == 0
+        v = value.abs
         cat = 0
-        v = abs_val
         while v > 0
           cat += 1
           v >>= 1
         end
-        bits = value > 0 ? value : value + (1 << cat) - 1
-        [cat, bits]
+        cat
+      end
+      # Return the extra bits to encode for a value with the given category.
+      def self.value_bits(value, cat)
+        value > 0 ? value : value + (1 << cat) - 1
       end
       def self.each_ac_item(zigzag)
@@ -39,7 +44,7 @@ module PureJPEG
           end
           value = zigzag[i]
-          cat, = category_and_bits(value)
+          cat = category(value)
           yield (run << 4) | cat, value
           i += 1
         end
@@ -73,10 +78,10 @@ module PureJPEG
       private
       def encode_dc(diff, writer)
-        cat, bits = self.class.category_and_bits(diff)
+        cat = self.class.category(diff)
         code, length = @dc_table[cat]
         writer.write_bits(code, length)
-        writer.write_bits(bits, cat) if cat > 0
+        writer.write_bits(self.class.value_bits(diff, cat), cat) if cat > 0
       end
       def encode_ac(zigzag, writer)
@@ -85,8 +90,8 @@ module PureJPEG
           writer.write_bits(code, length)
           next if symbol == 0x00 || symbol == 0xF0
-          cat, bits = self.class.category_and_bits(value)
-          writer.write_bits(bits, cat)
+          cat = self.class.category(value)
+          writer.write_bits(self.class.value_bits(value, cat), cat)
         end
       end
     end
@@ -104,7 +109,7 @@ module PureJPEG
         diff = zigzag[0] - @prev_dc[state_key]
         @prev_dc[state_key] = zigzag[0]
-        cat, = Encoder.category_and_bits(diff)
+        cat = Encoder.category(diff)
         @dc_frequencies[cat] += 1
         Encoder.each_ac_symbol(zigzag) do |symbol|

data/lib/pure_jpeg/quantization.rb CHANGED Viewed

@@ -36,9 +36,20 @@ module PureJPEG
       }
     end
-    # Quantize a 64-element DCT block in place into `out`.
+    # Quantize a 64-element DCT block into `out`.
+    # Uses integer rounding division (round-to-nearest) to match the
+    # behavior of Float division + round from the previous float DCT.
     def self.quantize!(block, table, out)
-      64.times { |i| out[i] = (block[i] / table[i]).round }
+      i = 0
+      while i < 64
+        v = block[i]; t = table[i]
+        out[i] = if v >= 0
+                   (v + (t >> 1)) / t
+                 else
+                   -((-v + (t >> 1)) / t)
+                 end
+        i += 1
+      end
       out
     end

data/lib/pure_jpeg/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module PureJPEG
-  VERSION = "0.3.1"
+  VERSION = "0.3.2"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: pure_jpeg
 version: !ruby/object:Gem::Version
-  version: 0.3.1
+  version: 0.3.2
 platform: ruby
 authors:
 - Peter Cooper
@@ -86,7 +86,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 4.0.3
+rubygems_version: 3.6.9
 specification_version: 4
 summary: Pure Ruby JPEG encoder and decoder
 test_files: []