RubyGems - philiprehberger-encoding_kit - Versions diffs - 0.3.0 → 0.5.0 - Mend

philiprehberger-encoding_kit 0.3.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +13 -0
data/README.md +28 -0
data/lib/philiprehberger/encoding_kit/converter.rb +14 -0
data/lib/philiprehberger/encoding_kit/version.rb +1 -1
data/lib/philiprehberger/encoding_kit.rb +81 -0
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: a328360d24956ba2fbcbc7888da4ddfb6dbe1bfac53aa65c6d7431ea6f0c1a16
-  data.tar.gz: f82f1a192f75e1ed0bf7ac386bb17cc4f259b4847af1a83cd5356ed63a2b675c
+  metadata.gz: 3e69668037179f14b92560b58c816a29a14134560bd32568775fd321325b4644
+  data.tar.gz: 3ad9997b697ca6cecca1d8e63955512eb860a86c48007a5138777167da9c4d13
 SHA512:
-  metadata.gz: 91b0e02d2d301db41bdc4eda1d3298db4fb517e0ec2bce4c4f79deae7787947db55eb0fdbe12557e9cf317b6b1d000c42ffb02b6901774afa531185b7ae46f1f
-  data.tar.gz: bbab15360e72374f5c9a540c3c8b0c54ae4b673f3bee24ff31ea47ebf2daa29c982564b69069a4f2ead487d5afbefdf606a04fc701c1d6bffd3c75038fd17026
+  metadata.gz: af626ca49ad283a08574162ed45b81fabdccf7aca6d978d78d9367df150234e28d7cd563a1c212145c3a399a7b431b65cd1508fbba67a784517344b124a85a38
+  data.tar.gz: ffd8620298177f0a689411d49bdadbcc61c0a93c0fb0c5afc6cd9e5f72ef77f627ec14aea5819ca898f4aeb774cbdef02957c1777d640712cab360c7eb6622e6

data/CHANGELOG.md CHANGED Viewed

@@ -7,6 +7,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.5.0] - 2026-04-30
+### Added
+- `EncodingKit.scrub(string)` — strip invalid bytes from a string (vs. `normalize` which replaces with `�`)
+- `EncodingKit.normalize_line_endings(string, to:)` — convert mixed CRLF/CR/LF to a single canonical form (`:lf`, `:crlf`, or `:cr`)
+- `Converter.scrub` companion method on the `Converter` module
+## [0.4.0] - 2026-04-20
+### Added
+- `guess_from_filename(path)` — extract an encoding hint from filename extensions (`.utf8`, `.utf-16`, `.latin1`, `.cp1252`, `.sjis`, etc.). Returns `nil` when no hint is present so callers can fall back to byte-based detection
+- `FILENAME_ENCODING_HINTS` constant exposing the suffix → `Encoding` lookup table
 ## [0.3.0] - 2026-04-11
 ### Added

data/README.md CHANGED Viewed

@@ -150,6 +150,31 @@ content = Philiprehberger::EncodingKit.read_as_utf8("latin1.txt", from: Encoding
 # Check if a file's encoding is valid
 Philiprehberger::EncodingKit.file_valid?("data.csv", encoding: Encoding::UTF_8)  # => true
+# Guess encoding from a filename hint without reading the bytes
+Philiprehberger::EncodingKit.guess_from_filename("data.utf8.csv")    # => Encoding::UTF_8
+Philiprehberger::EncodingKit.guess_from_filename("legacy.latin1.txt") # => Encoding::ISO_8859_1
+Philiprehberger::EncodingKit.guess_from_filename("report.csv")        # => nil
+```
+### Stripping Invalid Bytes
+```ruby
+# normalize replaces invalid bytes with U+FFFD ('�')
+Philiprehberger::EncodingKit.normalize("foo\xFFbar")  # => "foo�bar"
+# scrub removes them entirely
+Philiprehberger::EncodingKit.scrub("foo\xFFbar")      # => "foobar"
+```
+### Normalizing Line Endings
+```ruby
+mixed = "alpha\r\nbeta\rgamma\ndelta"
+Philiprehberger::EncodingKit.normalize_line_endings(mixed)              # => "alpha\nbeta\ngamma\ndelta"
+Philiprehberger::EncodingKit.normalize_line_endings(mixed, to: :crlf)   # => "alpha\r\nbeta\r\ngamma\r\ndelta"
+Philiprehberger::EncodingKit.normalize_line_endings(mixed, to: :cr)     # => "alpha\rbeta\rgamma\rdelta"
 ```
 ### Validity Check
@@ -172,6 +197,8 @@ Philiprehberger::EncodingKit.valid?("hello", encoding: Encoding::US_ASCII)  # =>
 | `EncodingKit.transcode(string, to:, fallback:, replace:)` | Auto-detect source and convert to target encoding |
 | `EncodingKit.to_utf8(string, from: nil)` | Convert to UTF-8, auto-detect source if `from` is nil |
 | `EncodingKit.normalize(string)` | Force to valid UTF-8, replacing bad bytes with U+FFFD |
+| `EncodingKit.scrub(string)` | Force to valid UTF-8 by removing invalid bytes entirely |
+| `EncodingKit.normalize_line_endings(string, to: :lf)` | Convert mixed CRLF/CR/LF to a single canonical form (`:lf`, `:crlf`, `:cr`) |
 | `EncodingKit.valid?(string, encoding: nil)` | Check if string is valid in given or current encoding |
 | `EncodingKit.convert(string, from:, to:)` | Convert between arbitrary encodings |
 | `EncodingKit.strip_bom(string)` | Remove byte order mark if present |
@@ -179,6 +206,7 @@ Philiprehberger::EncodingKit.valid?("hello", encoding: Encoding::US_ASCII)  # =>
 | `EncodingKit.detect_file(path, sample_size: 4096)` | Detect encoding of a file by reading a byte sample |
 | `EncodingKit.read_as_utf8(path, from: nil)` | Read a file and return its content as UTF-8 |
 | `EncodingKit.file_valid?(path, encoding: nil)` | Check if a file's content is valid in the given encoding |
+| `EncodingKit.guess_from_filename(path)` | Guess `Encoding` from filename suffixes (e.g. `.utf8`, `.latin1`), `nil` if unknown |
 ## Development

data/lib/philiprehberger/encoding_kit/converter.rb CHANGED Viewed

@@ -53,6 +53,20 @@ module Philiprehberger
           str.encode(Encoding::UTF_8, str.encoding, invalid: :replace, undef: :replace, replace: "\uFFFD")
         end
+        # Strip invalid bytes from a string, returning valid UTF-8 with bad bytes removed.
+        #
+        # Unlike {.normalize}, which replaces invalid bytes with `\uFFFD`, this method
+        # removes them entirely \u2014 useful when downstream consumers cannot tolerate
+        # any non-source content.
+        #
+        # @param string [String] the input string
+        # @return [String] valid UTF-8 string with invalid bytes removed
+        def scrub(string)
+          str = string.dup
+          str.force_encoding(Encoding::UTF_8) if [Encoding::BINARY, Encoding::ASCII_8BIT].include?(str.encoding)
+          str.scrub('')
+        end
       end
     end
   end

data/lib/philiprehberger/encoding_kit/version.rb CHANGED Viewed

@@ -2,6 +2,6 @@
 module Philiprehberger
   module EncodingKit
-    VERSION = '0.3.0'
+    VERSION = '0.5.0'
   end
 end

data/lib/philiprehberger/encoding_kit.rb CHANGED Viewed

@@ -104,6 +104,31 @@ module Philiprehberger
       Converter.normalize(string)
     end
+    # Strip invalid bytes from a string, returning valid UTF-8.
+    #
+    # Unlike {.normalize}, which replaces invalid bytes with `�`, this method
+    # removes them entirely.
+    #
+    # @param string [String] the input string
+    # @return [String] valid UTF-8 string with invalid bytes removed
+    def self.scrub(string)
+      Converter.scrub(string)
+    end
+    LINE_ENDINGS = { lf: "\n", crlf: "\r\n", cr: "\r" }.freeze
+    # Normalize line endings to a single canonical form.
+    #
+    # @param string [String] the input string
+    # @param to [Symbol] target line ending: `:lf`, `:crlf`, or `:cr`
+    # @return [String] string with normalized line endings
+    # @raise [Error] if `to:` is not one of `:lf`, `:crlf`, or `:cr`
+    def self.normalize_line_endings(string, to: :lf)
+      target = LINE_ENDINGS[to] or raise Error, "Unknown line ending: #{to.inspect} (expected :lf, :crlf, or :cr)"
+      string.gsub(/\r\n|\r|\n/, target)
+    end
     # Check if a string is valid in the given encoding (or its current encoding).
     #
     # @param string [String] the input string
@@ -201,6 +226,62 @@ module Philiprehberger
       valid?(raw, encoding: encoding)
     end
+    # Filename suffix / extension hints that imply a specific encoding.
+    # Matched against the final two extension tokens of the filename.
+    FILENAME_ENCODING_HINTS = {
+      'utf8' => Encoding::UTF_8,
+      'utf-8' => Encoding::UTF_8,
+      'utf16' => Encoding::UTF_16,
+      'utf-16' => Encoding::UTF_16,
+      'utf16le' => Encoding::UTF_16LE,
+      'utf-16le' => Encoding::UTF_16LE,
+      'utf16be' => Encoding::UTF_16BE,
+      'utf-16be' => Encoding::UTF_16BE,
+      'utf32' => Encoding::UTF_32,
+      'utf-32' => Encoding::UTF_32,
+      'ascii' => Encoding::US_ASCII,
+      'us-ascii' => Encoding::US_ASCII,
+      'latin1' => Encoding::ISO_8859_1,
+      'latin-1' => Encoding::ISO_8859_1,
+      'iso88591' => Encoding::ISO_8859_1,
+      'iso-8859-1' => Encoding::ISO_8859_1,
+      'iso88592' => Encoding::ISO_8859_2,
+      'iso-8859-2' => Encoding::ISO_8859_2,
+      'cp1252' => Encoding::Windows_1252,
+      'windows1252' => Encoding::Windows_1252,
+      'windows-1252' => Encoding::Windows_1252,
+      'sjis' => Encoding::Shift_JIS,
+      'shiftjis' => Encoding::Shift_JIS,
+      'shift-jis' => Encoding::Shift_JIS,
+      'shift_jis' => Encoding::Shift_JIS,
+      'euc-jp' => Encoding::EUC_JP,
+      'eucjp' => Encoding::EUC_JP,
+      'gbk' => Encoding::GBK,
+      'gb2312' => Encoding::GB2312,
+      'big5' => Encoding::Big5
+    }.freeze
+    # Guess the encoding based on filename suffixes/extensions alone.
+    # Useful when a file name carries an explicit encoding hint
+    # (e.g., "data.utf8.csv", "legacy.latin1.txt"). Falls back to nil
+    # when no hint can be extracted — callers should then use
+    # {.detect_file} to inspect the bytes.
+    #
+    # Matching is case-insensitive and considers the final two
+    # file extension tokens; the rightmost recognizable hint wins.
+    #
+    # @param filename [String] filename or path
+    # @return [Encoding, nil] detected encoding or nil when no hint matches
+    def self.guess_from_filename(filename)
+      name = File.basename(filename.to_s).downcase
+      tokens = name.split('.').last(3) # extension + up to two modifiers
+      tokens.reverse_each do |token|
+        enc = FILENAME_ENCODING_HINTS[token]
+        return enc if enc
+      end
+      nil
+    end
     # Build a list of encoding candidates with confidence scores.
     #
     # @param bytes [String] binary string

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: philiprehberger-encoding_kit
 version: !ruby/object:Gem::Version
-  version: 0.3.0
+  version: 0.5.0
 platform: ruby
 authors:
 - Philip Rehberger
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2026-04-11 00:00:00.000000000 Z
+date: 2026-05-01 00:00:00.000000000 Z
 dependencies: []
 description: Detect encoding from BOM and heuristics with confidence scores, convert
   between encodings, normalize to UTF-8, analyze byte distributions, and handle Windows