RubyGems - philiprehberger-encoding_kit - Versions diffs - 0.4.0 → 0.5.0 - Mend

philiprehberger-encoding_kit 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +7 -0
data/README.md +22 -0
data/lib/philiprehberger/encoding_kit/converter.rb +14 -0
data/lib/philiprehberger/encoding_kit/version.rb +1 -1
data/lib/philiprehberger/encoding_kit.rb +25 -0
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 135376935f830bcfabb9c706853c5a45d14b270d47cab9d5e54f1ddcfae5f1d4
-  data.tar.gz: 9b06d60d1fd8c3cea4c5197ecb0e26ababe21be31abd6ca92a029978039bc34a
+  metadata.gz: 3e69668037179f14b92560b58c816a29a14134560bd32568775fd321325b4644
+  data.tar.gz: 3ad9997b697ca6cecca1d8e63955512eb860a86c48007a5138777167da9c4d13
 SHA512:
-  metadata.gz: 4e3f14286a3ee38a666a246bc513c972c7483b9b3f77f7ce9e2bfc7324d13eb5cd8a4dc1c3fd3b2d524219b5173679ccfcb5bc510f1f9cedf8eb1f865210b1aa
-  data.tar.gz: a71393c68877452b9005b69415b321ddc2287c57d87de20ec55abb6f57f242dcf64bebe0412e62d9b984371a5682c797c0c58614bcc629ba485570399abf496b
+  metadata.gz: af626ca49ad283a08574162ed45b81fabdccf7aca6d978d78d9367df150234e28d7cd563a1c212145c3a399a7b431b65cd1508fbba67a784517344b124a85a38
+  data.tar.gz: ffd8620298177f0a689411d49bdadbcc61c0a93c0fb0c5afc6cd9e5f72ef77f627ec14aea5819ca898f4aeb774cbdef02957c1777d640712cab360c7eb6622e6

data/CHANGELOG.md CHANGED Viewed

@@ -7,6 +7,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [0.5.0] - 2026-04-30
+### Added
+- `EncodingKit.scrub(string)` — strip invalid bytes from a string (vs. `normalize` which replaces with `�`)
+- `EncodingKit.normalize_line_endings(string, to:)` — convert mixed CRLF/CR/LF to a single canonical form (`:lf`, `:crlf`, or `:cr`)
+- `Converter.scrub` companion method on the `Converter` module
 ## [0.4.0] - 2026-04-20
 ### Added

data/README.md CHANGED Viewed

@@ -157,6 +157,26 @@ Philiprehberger::EncodingKit.guess_from_filename("legacy.latin1.txt") # => Encod
 Philiprehberger::EncodingKit.guess_from_filename("report.csv")        # => nil
 ```
+### Stripping Invalid Bytes
+```ruby
+# normalize replaces invalid bytes with U+FFFD ('�')
+Philiprehberger::EncodingKit.normalize("foo\xFFbar")  # => "foo�bar"
+# scrub removes them entirely
+Philiprehberger::EncodingKit.scrub("foo\xFFbar")      # => "foobar"
+```
+### Normalizing Line Endings
+```ruby
+mixed = "alpha\r\nbeta\rgamma\ndelta"
+Philiprehberger::EncodingKit.normalize_line_endings(mixed)              # => "alpha\nbeta\ngamma\ndelta"
+Philiprehberger::EncodingKit.normalize_line_endings(mixed, to: :crlf)   # => "alpha\r\nbeta\r\ngamma\r\ndelta"
+Philiprehberger::EncodingKit.normalize_line_endings(mixed, to: :cr)     # => "alpha\rbeta\rgamma\rdelta"
+```
 ### Validity Check
 ```ruby
@@ -177,6 +197,8 @@ Philiprehberger::EncodingKit.valid?("hello", encoding: Encoding::US_ASCII)  # =>
 | `EncodingKit.transcode(string, to:, fallback:, replace:)` | Auto-detect source and convert to target encoding |
 | `EncodingKit.to_utf8(string, from: nil)` | Convert to UTF-8, auto-detect source if `from` is nil |
 | `EncodingKit.normalize(string)` | Force to valid UTF-8, replacing bad bytes with U+FFFD |
+| `EncodingKit.scrub(string)` | Force to valid UTF-8 by removing invalid bytes entirely |
+| `EncodingKit.normalize_line_endings(string, to: :lf)` | Convert mixed CRLF/CR/LF to a single canonical form (`:lf`, `:crlf`, `:cr`) |
 | `EncodingKit.valid?(string, encoding: nil)` | Check if string is valid in given or current encoding |
 | `EncodingKit.convert(string, from:, to:)` | Convert between arbitrary encodings |
 | `EncodingKit.strip_bom(string)` | Remove byte order mark if present |

data/lib/philiprehberger/encoding_kit/converter.rb CHANGED Viewed

@@ -53,6 +53,20 @@ module Philiprehberger
           str.encode(Encoding::UTF_8, str.encoding, invalid: :replace, undef: :replace, replace: "\uFFFD")
         end
+        # Strip invalid bytes from a string, returning valid UTF-8 with bad bytes removed.
+        #
+        # Unlike {.normalize}, which replaces invalid bytes with `\uFFFD`, this method
+        # removes them entirely \u2014 useful when downstream consumers cannot tolerate
+        # any non-source content.
+        #
+        # @param string [String] the input string
+        # @return [String] valid UTF-8 string with invalid bytes removed
+        def scrub(string)
+          str = string.dup
+          str.force_encoding(Encoding::UTF_8) if [Encoding::BINARY, Encoding::ASCII_8BIT].include?(str.encoding)
+          str.scrub('')
+        end
       end
     end
   end

data/lib/philiprehberger/encoding_kit/version.rb CHANGED Viewed

@@ -2,6 +2,6 @@
 module Philiprehberger
   module EncodingKit
-    VERSION = '0.4.0'
+    VERSION = '0.5.0'
   end
 end

data/lib/philiprehberger/encoding_kit.rb CHANGED Viewed

@@ -104,6 +104,31 @@ module Philiprehberger
       Converter.normalize(string)
     end
+    # Strip invalid bytes from a string, returning valid UTF-8.
+    #
+    # Unlike {.normalize}, which replaces invalid bytes with `�`, this method
+    # removes them entirely.
+    #
+    # @param string [String] the input string
+    # @return [String] valid UTF-8 string with invalid bytes removed
+    def self.scrub(string)
+      Converter.scrub(string)
+    end
+    LINE_ENDINGS = { lf: "\n", crlf: "\r\n", cr: "\r" }.freeze
+    # Normalize line endings to a single canonical form.
+    #
+    # @param string [String] the input string
+    # @param to [Symbol] target line ending: `:lf`, `:crlf`, or `:cr`
+    # @return [String] string with normalized line endings
+    # @raise [Error] if `to:` is not one of `:lf`, `:crlf`, or `:cr`
+    def self.normalize_line_endings(string, to: :lf)
+      target = LINE_ENDINGS[to] or raise Error, "Unknown line ending: #{to.inspect} (expected :lf, :crlf, or :cr)"
+      string.gsub(/\r\n|\r|\n/, target)
+    end
     # Check if a string is valid in the given encoding (or its current encoding).
     #
     # @param string [String] the input string

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: philiprehberger-encoding_kit
 version: !ruby/object:Gem::Version
-  version: 0.4.0
+  version: 0.5.0
 platform: ruby
 authors:
 - Philip Rehberger
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2026-04-20 00:00:00.000000000 Z
+date: 2026-05-01 00:00:00.000000000 Z
 dependencies: []
 description: Detect encoding from BOM and heuristics with confidence scores, convert
   between encodings, normalize to UTF-8, analyze byte distributions, and handle Windows